# Machine Learning/Kaggle Social Network Contest/load data

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

#  R

##  igraph

The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.

Grab the package with:

```install.packages("igraph")
```

```data <-as.matrix(read.csv("social_train.csv", header = FALSE));
dg <- graph.edgelist(data, directed=TRUE)
```

Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.

# Python

##  How to load the network into networkx

There is a network analysis package for Python called networkx. This package can be installed using easy_install.

NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.

Method 1

```import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
```

Method 2

```import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

for row in netcsv:
tmp1 = int(row[0])
tmp2 = int(row[1])

print "Loaded in ", str(time.clock() - t0), "s"
```

Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!

Rows 1M 2M 3M
Method 1 20s 53s 103s
Method 2 15s 41s 86s

#  Ruby

##  Note on CSV Libraries

If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try FasterCSV (require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.

```require 'rubygems'
require 'faster_csv'
FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row|
node_id=row.shift