Machine Learning/Kaggle Social Network Contest/load data: Difference between revisions

From Noisebridge
Jump to navigation Jump to search
No edit summary
Line 3: Line 3:


The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges
The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges
NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.


Method 1
Method 1

Revision as of 23:24, 18 November 2010

How to load the network into networkx

There is a network analysis package for Python called networkx. This package can be installed using easy_install.

The network can be loaded using the read_edgelist function in networkx or by manually adding edges

NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.

Method 1

import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')


Method 2

import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
    tmp1 = int(row[0])
    tmp2 = int(row[1])
    DG.add_edge(tmp1, tmp2)


print "Loaded in ", str(time.clock() - t0), "s"
Rows 1M 2M 3M
Method 1 20s 53s 103s
Method 2 15s 41s 86s