Machine Learning/Kaggle Social Network Contest/Problem Representation: Difference between revisions

From Noisebridge
Jump to navigation Jump to search
mNo edit summary
Line 28: Line 28:
== Idea B ==  
== Idea B ==  
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.
== Idea B ==
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.
Can whoever added/proposed this please flesh this out more? Curious to explore alternative approaches for sure, even if they seem computationally difficult.

Revision as of 18:53, 19 November 2010

TODO

  • come up with a plan of attack.

Idea A

Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.

It would have the following format

node_i, node_j, feature_ij_1, feature_ij_2, ...

  • The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
  • The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)

The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.

Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes

If we just consider the 38k outbound nodes we'd still be dealing with a 112 GB file.

This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.


Suggestion on how to tame the size: Perhaps breaking into two csvs with one for data columns about nodes and another for columns about edges. For example:

  • node_features.csv: node_id, inbound_degree, outbound_degree, clustering coefficient, etc, etc.
  • edge_features.csv: node_i, node_j, shortest_distance, density, etc, etc

If both types of data are in one datafile, we'd be probably be duplicating any single-node-centric data points for every single edge row. I understand we might need to ultimately need to create such a single file, but I feel like two files will help keep it manageable as we identify and calculate feature data in the short term.

Idea B

We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Idea B

We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Can whoever added/proposed this please flesh this out more? Curious to explore alternative approaches for sure, even if they seem computationally difficult.