Machine Learning/Kaggle Social Network Contest/Problem Representation: Difference between revisions

Latest revision as of 22:26, 21 November 2010

TODO[edit]

come up with a plan of attack.

Idea C - two hop neighbourhood[edit]

For each node of 38k sampled individuals calculate features based on the a two hop neighbourhood -ie friends of friends.

these neighbourhoods should make the problem a little more tractable.

Idea A - huge CSV[edit]

Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.

It would have the following format

node_i, node_j, feature_ij_1, feature_ij_2, ...

The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)

The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.

Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes

If we just consider the 38k outbound nodes we'd still be dealing with a 112 GB file.

This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.

Suggestion on how to tame the size: Perhaps breaking into two csvs with one for data columns about nodes and another for columns about edges. For example:

node_features.csv: node_id, inbound_degree, outbound_degree, clustering coefficient, etc, etc.
edge_features.csv: node_i, node_j, shortest_distance, density, etc, etc

If both types of data are in one datafile, we'd be probably be duplicating any single-node-centric data points for every single edge row. I understand we might need to ultimately need to create such a single file, but I feel like two files will help keep it manageable as we identify and calculate feature data in the short term.

UPDATE: Network Size is definitely going to be an issue. Traversing the network to calculate shortest_distance crashed and burned with memory shortfalls. Initially I thought it was manageable until I allowed my code to use the entire adjacency list. Even without outputting data for edges with no path between them, I think the data dump storage itself would also be a problem. I am going to try rerunning the shortest_distance search with a max depth limit of ~~six~~ ~~three~~ two to see if that makes it more manageable.
UPDATE 2: The BFS search approach I was taking to path finding was too slow since I kept retracing the data, so I just tried scripting a dump of all "friend-of-friends" edges. After 5.5 hours computation time I have "friend-of-friend" edge dump but it's a 6.5GB file. That said, I accidentally included duplicate edges, so hopefully after deduping it will be um smaller.

Idea B - online learning[edit]

We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Can whoever added/proposed this please flesh this out more? Curious to explore alternative approaches for sure, even if they seem computationally difficult.

@@ Line 1: / Line 1: @@
 == TODO ==
-* someone with large memory (>5.5GB) double check the number of unique nodes by loading it in networkx
 * come up with a plan of attack.
-== Idea A ==
+== Idea C - two hop neighbourhood ==
+For each node of 38k sampled individuals calculate features based on the a two hop neighbourhood -ie friends of friends.
+* these neighbourhoods should make the problem a little more tractable.
+== Idea A - huge CSV==
 Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.
@@ Line 10: / Line 14: @@
 node_i, node_j, feature_ij_1, feature_ij_2, ...
-The length of this would be long. When loading 3M rows of the edge list file I get 732166 nodes which means that this file would need  (732 166^2) - 732 166 = 536 066 319 390 rows.
+* The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
+* The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)
+The length of this would be huge. The file would need about  (37689 * 1133547) - 1133547 = 42 721 119 336 rows.
+Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes
+If we just consider the 38k outbound nodes we'd still be dealing with a 112 GB file.
+This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.
-Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 4.5 x10^13 bytes = 41 937 gigabytes
-This is just if we use the first 3 million rows.
+Suggestion on how to tame the size: Perhaps breaking into two csvs with one for data columns about nodes and another for columns about edges. For example:
+* node_features.csv: node_id, inbound_degree, outbound_degree, clustering coefficient, etc, etc.
+* edge_features.csv: node_i, node_j, shortest_distance, density, etc, etc
+If both types of data are in one datafile, we'd be probably be duplicating any single-node-centric data points for every single edge row.  I understand we might need to ultimately need to create such a single file, but I feel like two files will help keep it manageable as we identify and calculate feature data in the short term.
-(Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.)
+UPDATE: Network Size is definitely going to be an issue.  Traversing the network to calculate shortest_distance crashed and burned with memory shortfalls.  Initially I thought it was manageable until I allowed my code to use the entire adjacency list.  Even without outputting data for edges with no path between them, I think the data dump storage itself would also be a problem. I am going to try rerunning the shortest_distance search with a max depth limit of <strike>six</strike> <strike>three</strike> two to see if that makes it more manageable.<br/>
+UPDATE 2: The BFS search approach I was taking to path finding was too slow since I kept retracing the data, so I just tried scripting a dump of all "friend-of-friends" edges.  After 5.5 hours computation time I have "friend-of-friend" edge dump but it's a 6.5GB file.  That said, I accidentally included duplicate edges, so hopefully after deduping it will be um smaller.
-This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.
+== Idea B - online learning ==
+We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.
-== Idea B ==
+Can whoever added/proposed this please flesh this out more? Curious to explore alternative approaches for sure, even if they seem computationally difficult.
-We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 500 billion steps - which sounds like a lot (again just based on the first 3M rows from the edge file).

Machine Learning/Kaggle Social Network Contest/Problem Representation: Difference between revisions

Latest revision as of 22:26, 21 November 2010

Contents

TODO[edit]

Idea C - two hop neighbourhood[edit]

Idea A - huge CSV[edit]

Idea B - online learning[edit]

Navigation menu

Machine Learning/Kaggle Social Network Contest/Problem Representation: Difference between revisions

Latest revision as of 22:26, 21 November 2010

TODO[edit]

Idea C - two hop neighbourhood[edit]

Idea A - huge CSV[edit]

Idea B - online learning[edit]

Navigation menu

Search