Machine Learning/Kaggle Social Network Contest/Problem Representation
- come up with a plan of attack.
Idea C - two hop neighbourhood
For each node of 38k sampled individuals calculate features based on the a two hop neighbourhood -ie friends of friends.
- these neighbourhoods should make the problem a little more tractable.
Idea A - huge CSV
Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.
It would have the following format
node_i, node_j, feature_ij_1, feature_ij_2, ...
- The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
- The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)
The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.
Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes
If we just consider the 38k outbound nodes we'd still be dealing with a 112 GB file.
This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.
Suggestion on how to tame the size: Perhaps breaking into two csvs with one for data columns about nodes and another for columns about edges. For example:
- node_features.csv: node_id, inbound_degree, outbound_degree, clustering coefficient, etc, etc.
- edge_features.csv: node_i, node_j, shortest_distance, density, etc, etc
If both types of data are in one datafile, we'd be probably be duplicating any single-node-centric data points for every single edge row. I understand we might need to ultimately need to create such a single file, but I feel like two files will help keep it manageable as we identify and calculate feature data in the short term.
UPDATE: Jared is working on an initial edge_features.csv with just 3 columns (node_i, node_j, shortest_distance) and is already hitting some memory and storage limits. Traversing the graph isn't the hard part it's the shear number of total possible edges to output data for (as noted above). Rerunning now after some code improvements...crossing fingers. Might be ok on memory now but I'll possibly not have enough disk space on this machine for even shortest path feature. We might need to set up Hadoop or something if we wanna take this approach! ;)
Idea B - online learning
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.
Can whoever added/proposed this please flesh this out more? Curious to explore alternative approaches for sure, even if they seem computationally difficult.