Machine Learning/Kaggle Social Network Contest/load data

From Noisebridge
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

How to load the network into networkx

There is a network analysis package for Python called networkx. This package can be installed using easy_install.

The network can be loaded using the read_edgelist function in networkx or by manually adding edges

NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.

Method 1

import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')


Method 2

import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
    tmp1 = int(row[0])
    tmp2 = int(row[1])
    DG.add_edge(tmp1, tmp2)


print "Loaded in ", str(time.clock() - t0), "s"

Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!

Rows 1M 2M 3M
Method 1 20s 53s 103s
Method 2 15s 41s 86s