Machine Learning/Kaggle Social Network Contest/Problem Representation

From Noisebridge
< Machine Learning | Kaggle Social Network Contest(Difference between revisions)
Jump to: navigation, search
(Created page with '== TODO == * someone with large memory (>5.5GB) double check the number of unique nodes by loading it in networkx * come up with a plan of attack. == Idea A == Construct a huge…')
 
m
 
(11 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
== TODO ==
 
== TODO ==
* someone with large memory (>5.5GB) double check the number of unique nodes by loading it in networkx
 
 
* come up with a plan of attack.
 
* come up with a plan of attack.
  
== Idea A ==  
+
== Idea C - two hop neighbourhood ==
 +
For each node of 38k sampled individuals calculate features based on the a two hop neighbourhood -ie friends of friends.
 +
* these neighbourhoods should make the problem a little more tractable.
 +
 
 +
 
 +
== Idea A - huge CSV==  
 
Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.
 
Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.
  
Line 10: Line 14:
 
node_i, node_j, feature_ij_1, feature_ij_2, ...
 
node_i, node_j, feature_ij_1, feature_ij_2, ...
  
The length of this would be long. When loading 3M rows of the edge list file I get 732166 nodes which means that this file would need  (732 166^2) - 732 166 = 536 066 319 390 rows.  
+
* The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
 +
* The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)
 +
 
 +
The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.
 +
 
 +
Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes
 +
 +
If we just consider the 38k outbound nodes we'd still be dealing with a 112 GB file.
 +
 
 +
This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.
  
Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 4.5 x10^13 bytes = 41 937 gigabytes
 
  
This is just if we use the first 3 million rows.
+
Suggestion on how to tame the size: Perhaps breaking into two csvs with one for data columns about nodes and another for columns about edges. For example:
 +
* node_features.csv: node_id, inbound_degree, outbound_degree, clustering coefficient, etc, etc.
 +
* edge_features.csv: node_i, node_j, shortest_distance, density, etc, etc
 +
If both types of data are in one datafile, we'd be probably be duplicating any single-node-centric data points for every single edge row.  I understand we might need to ultimately need to create such a single file, but I feel like two files will help keep it manageable as we identify and calculate feature data in the short term.
  
(Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.)
+
UPDATE: Network Size is definitely going to be an issue.  Traversing the network to calculate shortest_distance crashed and burned with memory shortfalls.  Initially I thought it was manageable until I allowed my code to use the entire adjacency list.  Even without outputting data for edges with no path between them, I think the data dump storage itself would also be a problem. I am going to try rerunning the shortest_distance search with a max depth limit of <strike>six</strike> <strike>three</strike> two to see if that makes it more manageable.<br/>
 +
UPDATE 2: The BFS search approach I was taking to path finding was too slow since I kept retracing the data, so I just tried scripting a dump of all "friend-of-friends" edges.  After 5.5 hours computation time I have "friend-of-friend" edge dump but it's a 6.5GB file. That said, I accidentally included duplicate edges, so hopefully after deduping it will be um smaller.
  
This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.  
+
== Idea B - online learning ==
 +
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.
  
== Idea B ==
+
Can whoever added/proposed this please flesh this out more? Curious to explore alternative approaches for sure, even if they seem computationally difficult.
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 500 billion steps - which sounds like a lot (again just based on the first 3M rows from the edge file).
+

Latest revision as of 22:26, 21 November 2010

Contents

[edit] TODO

  • come up with a plan of attack.

[edit] Idea C - two hop neighbourhood

For each node of 38k sampled individuals calculate features based on the a two hop neighbourhood -ie friends of friends.

  • these neighbourhoods should make the problem a little more tractable.


[edit] Idea A - huge CSV

Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.

It would have the following format

node_i, node_j, feature_ij_1, feature_ij_2, ...

  • The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
  • The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)

The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.

Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes

If we just consider the 38k outbound nodes we'd still be dealing with a 112 GB file.

This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.


Suggestion on how to tame the size: Perhaps breaking into two csvs with one for data columns about nodes and another for columns about edges. For example:

  • node_features.csv: node_id, inbound_degree, outbound_degree, clustering coefficient, etc, etc.
  • edge_features.csv: node_i, node_j, shortest_distance, density, etc, etc

If both types of data are in one datafile, we'd be probably be duplicating any single-node-centric data points for every single edge row. I understand we might need to ultimately need to create such a single file, but I feel like two files will help keep it manageable as we identify and calculate feature data in the short term.

UPDATE: Network Size is definitely going to be an issue. Traversing the network to calculate shortest_distance crashed and burned with memory shortfalls. Initially I thought it was manageable until I allowed my code to use the entire adjacency list. Even without outputting data for edges with no path between them, I think the data dump storage itself would also be a problem. I am going to try rerunning the shortest_distance search with a max depth limit of six three two to see if that makes it more manageable.
UPDATE 2: The BFS search approach I was taking to path finding was too slow since I kept retracing the data, so I just tried scripting a dump of all "friend-of-friends" edges. After 5.5 hours computation time I have "friend-of-friend" edge dump but it's a 6.5GB file. That said, I accidentally included duplicate edges, so hopefully after deduping it will be um smaller.

[edit] Idea B - online learning

We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Can whoever added/proposed this please flesh this out more? Curious to explore alternative approaches for sure, even if they seem computationally difficult.

Personal tools