Machine Learning/Kaggle Social Network Contest/Problem Representation

From Noisebridge
< Machine Learning | Kaggle Social Network Contest(Difference between revisions)
Jump to: navigation, search
(Created page with '== TODO == * someone with large memory (>5.5GB) double check the number of unique nodes by loading it in networkx * come up with a plan of attack. == Idea A == Construct a huge…')
 
Line 1: Line 1:
 
== TODO ==
 
== TODO ==
* someone with large memory (>5.5GB) double check the number of unique nodes by loading it in networkx
 
 
* come up with a plan of attack.
 
* come up with a plan of attack.
  
Line 10: Line 9:
 
node_i, node_j, feature_ij_1, feature_ij_2, ...
 
node_i, node_j, feature_ij_1, feature_ij_2, ...
  
The length of this would be long. When loading 3M rows of the edge list file I get 732166 nodes which means that this file would need  (732 166^2) - 732 166 = 536 066 319 390 rows.
+
* The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
 +
* The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)
  
Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 4.5 x10^13 bytes = 41 937 gigabytes
+
The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.
  
This is just if we use the first 3 million rows.
+
Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes
  
(Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.)
+
Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.)
  
 
This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.  
 
This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.  
  
 
== Idea B ==  
 
== Idea B ==  
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 500 billion steps - which sounds like a lot (again just based on the first 3M rows from the edge file).
+
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Revision as of 00:40, 19 November 2010

TODO

  • come up with a plan of attack.

Idea A

Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.

It would have the following format

node_i, node_j, feature_ij_1, feature_ij_2, ...

  • The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
  • The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)

The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.

Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes

Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.)

This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.

Idea B

We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Personal tools