Machine Learning/Kaggle Social Network Contest/Network Description: Difference between revisions

From Noisebridge
Jump to navigation Jump to search
(→‎Conectivity: effect of adding test data on clusters)
Line 9: Line 9:
"A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." [http://en.wikipedia.org/wiki/Glossary_of_graph_theory wikipedia]
"A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." [http://en.wikipedia.org/wiki/Glossary_of_graph_theory wikipedia]


* The Graph is '''not''' weakly connected
* The Training Graph is '''not''' weakly connected
* It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs.
* It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs.
** c.f. [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clustering]
** c.f. [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clustering]
Line 38: Line 38:
|    1
|    1
|}
|}
When I added all of the test data to the graph and then re-ran the cluster analysis it found 22 clusters instead of 27. The largest cluster grew by 72 vertices.
{| border="1"
|-
!| Cluster Size
| 1 
|  2 
|  3 
|  4 
|  5   
|  7 
| 10
| 23
| 37
| 1133394
| 1133466
|-
!| Train
| 1
| 13
| 3
| 2
| 2
| 1
| 1
| 2
| 1
| 1
| 0
|-   
!| Train + Test
| 1
| 13
| 2
| 1
| 1
| 1
| 1
| 1
| 0
| 0
| 1
|}
Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes?
* TODO figure out probs of adding and removing nodes under different sampling hypotheses.
* I'm guessing that the chances of a randomly generated edge joins the small clusters is very low.
* Diameter of the directed graph is 14
* Diameter of the directed graph is 14
** This is the longest of the shortest directed paths between two nodes
** This is the longest of the shortest directed paths between two nodes

Revision as of 00:53, 24 November 2010

Here we can put the descriptive statistics of the network:

  • Number of fully sampled nodes: 37,689
    • ie the unique "outnodes" in the edge list
  • Total number of nodes: 1,133,547
  • number of edges: 7,237,983

Conectivity

"A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." wikipedia

  • The Training Graph is not weakly connected
  • It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs.
    • c.f. igraph clustering
    • There is one very large cluster containing all but 154 verticies, then 4 with size 10 - 37, 8 sized 3 - 7 and 13 size 2
    • note that igraph seems to create a vertex labelled 0 but the labels in the traindata file range from 1 to 1133547
  • I also grabbed the number of strongly connected subgraphs
Cluster Size 1 2 3 4 5 9 10 32464
freq 1100647 162 18 5 4 1 1 1

When I added all of the test data to the graph and then re-ran the cluster analysis it found 22 clusters instead of 27. The largest cluster grew by 72 vertices.

Cluster Size 1 2 3 4 5 7 10 23 37 1133394 1133466
Train 1 13 3 2 2 1 1 2 1 1 0
Train + Test 1 13 2 1 1 1 1 1 0 0 1

Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes?

  • TODO figure out probs of adding and removing nodes under different sampling hypotheses.
  • I'm guessing that the chances of a randomly generated edge joins the small clusters is very low.
  • Diameter of the directed graph is 14
    • This is the longest of the shortest directed paths between two nodes
    • R igraph
      • diameter (dg, directed = TRUE, unconnected = TRUE)
      • Was taking forever so I aborted (after 34 minutes...)
  • Total number of direct neighbours out: 7 275 672, in: 508 688, all: 7 473 273
    • For each of our 38k I calculated the number of outbound neighbours and summed it
    • R igraph: