# Machine Learning/Kaggle Social Network Contest/Network Description

(→Conectivity) |
(→Conectivity: effect of adding test data on clusters) |
||

Line 9: | Line 9: | ||

"A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." [http://en.wikipedia.org/wiki/Glossary_of_graph_theory wikipedia] | "A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." [http://en.wikipedia.org/wiki/Glossary_of_graph_theory wikipedia] | ||

− | * The Graph is '''not''' weakly connected | + | * The Training Graph is '''not''' weakly connected |

* It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs. | * It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs. | ||

** c.f. [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clustering] | ** c.f. [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clustering] | ||

Line 38: | Line 38: | ||

| 1 | | 1 | ||

|} | |} | ||

+ | |||

+ | When I added all of the test data to the graph and then re-ran the cluster analysis it found 22 clusters instead of 27. The largest cluster grew by 72 vertices. | ||

+ | {| border="1" | ||

+ | |- | ||

+ | !| Cluster Size | ||

+ | | 1 | ||

+ | | 2 | ||

+ | | 3 | ||

+ | | 4 | ||

+ | | 5 | ||

+ | | 7 | ||

+ | | 10 | ||

+ | | 23 | ||

+ | | 37 | ||

+ | | 1133394 | ||

+ | | 1133466 | ||

+ | |- | ||

+ | !| Train | ||

+ | | 1 | ||

+ | | 13 | ||

+ | | 3 | ||

+ | | 2 | ||

+ | | 2 | ||

+ | | 1 | ||

+ | | 1 | ||

+ | | 2 | ||

+ | | 1 | ||

+ | | 1 | ||

+ | | 0 | ||

+ | |- | ||

+ | !| Train + Test | ||

+ | | 1 | ||

+ | | 13 | ||

+ | | 2 | ||

+ | | 1 | ||

+ | | 1 | ||

+ | | 1 | ||

+ | | 1 | ||

+ | | 1 | ||

+ | | 0 | ||

+ | | 0 | ||

+ | | 1 | ||

+ | |} | ||

+ | |||

+ | Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes? | ||

+ | * TODO figure out probs of adding and removing nodes under different sampling hypotheses. | ||

+ | * I'm guessing that the chances of a randomly generated edge joins the small clusters is very low. | ||

+ | |||

* Diameter of the directed graph is 14 | * Diameter of the directed graph is 14 | ||

** This is the longest of the shortest directed paths between two nodes | ** This is the longest of the shortest directed paths between two nodes |

## Revision as of 00:53, 24 November 2010

Here we can put the descriptive statistics of the network:

- Number of fully sampled nodes: 37,689
- ie the unique "outnodes" in the edge list

- Total number of nodes: 1,133,547
- number of edges: 7,237,983

## Conectivity

"A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." wikipedia

- The Training Graph is
**not**weakly connected - It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs.
- c.f. igraph clustering
- There is one very large cluster containing all but 154 verticies, then 4 with size 10 - 37, 8 sized 3 - 7 and 13 size 2
- note that igraph seems to create a vertex labelled 0 but the labels in the traindata file range from 1 to 1133547

- I also grabbed the number of strongly connected subgraphs

Cluster Size | 1 | 2 | 3 | 4 | 5 | 9 | 10 | 32464 |
---|---|---|---|---|---|---|---|---|

freq | 1100647 | 162 | 18 | 5 | 4 | 1 | 1 | 1 |

When I added all of the test data to the graph and then re-ran the cluster analysis it found 22 clusters instead of 27. The largest cluster grew by 72 vertices.

Cluster Size | 1 | 2 | 3 | 4 | 5 | 7 | 10 | 23 | 37 | 1133394 | 1133466 |
---|---|---|---|---|---|---|---|---|---|---|---|

Train | 1 | 13 | 3 | 2 | 2 | 1 | 1 | 2 | 1 | 1 | 0 |

Train + Test | 1 | 13 | 2 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |

Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes?

- TODO figure out probs of adding and removing nodes under different sampling hypotheses.
- I'm guessing that the chances of a randomly generated edge joins the small clusters is very low.

- Diameter of the directed graph is 14
- This is the longest of the shortest directed paths between two nodes
- R igraph
- diameter (dg, directed = TRUE, unconnected = TRUE)
- Was taking forever so I aborted (after 34 minutes...)

- Total number of direct neighbours out: 7 275 672, in: 508 688, all: 7 473 273
- For each of our 38k I calculated the number of outbound neighbours and summed it
- R igraph:
- sum(neighborhood.size(dg, 1, nodes=myGuys, mode="out"))
- mode = "in", "out" or "all"