Machine Learning/Kaggle Social Network Contest/load data

From Noisebridge
< Machine Learning | Kaggle Social Network Contest(Difference between revisions)
Jump to: navigation, search
(Created page with '== How to load the network into networkx == There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_i…')
 
Line 2: Line 2:
 
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.
 
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.
  
The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx  
+
The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges
eg:
+
 
 +
Method 1
 
<pre>
 
<pre>
 
import networkx as nx
 
import networkx as nx
Line 10: Line 11:
  
  
Loading 1M rows of the edge list took 21s on a MacPro with 3GB mem and 2.8Ghz Quad-Core processor. I can do this in 15s with the following.
 
 
An alternate method of loading it is the follow which seems to run quicker for me (Joe).
 
  
 +
Method 2
 
<pre>
 
<pre>
 
import networkx as nx
 
import networkx as nx
Line 32: Line 31:
 
print "Loaded in ", str(time.clock() - t0), "s"
 
print "Loaded in ", str(time.clock() - t0), "s"
 
</pre>
 
</pre>
 +
 +
{| border="1"
 +
|-
 +
!|Rows
 +
!| 1M
 +
!| 2M
 +
!| 3M
 +
|-
 +
!|Method 1
 +
| 20s
 +
| 53s
 +
| 103s
 +
|-
 +
!|Method 2
 +
| 15s
 +
| 41s
 +
| 86s
 +
|}

Revision as of 23:19, 18 November 2010

How to load the network into networkx

There is a network analysis package for Python called networkx. This package can be installed using easy_install.

The network can be loaded using the read_edgelist function in networkx or by manually adding edges

Method 1

import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')


Method 2

import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
    tmp1 = int(row[0])
    tmp2 = int(row[1])
    DG.add_edge(tmp1, tmp2)


print "Loaded in ", str(time.clock() - t0), "s"
Rows 1M 2M 3M
Method 1 20s 53s 103s
Method 2 15s 41s 86s
Personal tools