Machine Learning/Kaggle Social Network Contest/load data

From Noisebridge
< Machine Learning | Kaggle Social Network Contest(Difference between revisions)
Jump to: navigation, search
(igraph)
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
= R =
 +
== igraph ==
 +
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.
 +
 +
Grab the package with:
 +
<pre>
 +
install.packages("igraph")
 +
</pre>
 +
 +
Load the data using:
 +
<pre>
 +
data <-as.matrix(read.csv("social_train.csv", header = FALSE));
 +
dg <- graph.edgelist(data, directed=TRUE)
 +
</pre>
 +
 +
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.
 +
 +
=Python=
 
== How to load the network into networkx ==
 
== How to load the network into networkx ==
 
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.
 
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.
Line 4: Line 22:
 
The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges
 
The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges
  
Method 1
+
NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.
 +
 
 +
'''Method 1'''
 
<pre>
 
<pre>
 
import networkx as nx
 
import networkx as nx
Line 10: Line 30:
 
</pre>
 
</pre>
  
 
+
'''Method 2'''
 
+
Method 2
+
 
<pre>
 
<pre>
 
import networkx as nx
 
import networkx as nx
Line 32: Line 50:
 
</pre>
 
</pre>
  
{| border="1"
+
Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core  machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous  analysis of the methods!
 +
{| border="1"  
 
|-
 
|-
 
!|Rows
 
!|Rows
Line 49: Line 68:
 
| 86s
 
| 86s
 
|}
 
|}
 +
 +
 +
= Ruby =
 +
 +
== Note on CSV Libraries ==
 +
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv').  For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.
 +
 +
== Loading Adjacency Lists ==
 +
<pre>
 +
require 'rubygems'
 +
require 'faster_csv'
 +
def load_adj_list_faster(filename)
 +
  adj_list_hash={}
 +
  FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row|
 +
    node_id=row.shift
 +
    list_of_adj=row
 +
    adj_list_hash[node_id] = list_of_adj
 +
  end
 +
  return adj_list_hash
 +
end
 +
 +
adj_list_lookup = load_adj_list_faster('adj_list.out.csv')
 +
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')
 +
</pre>

Latest revision as of 13:28, 23 November 2010

Contents

[edit] R

[edit] igraph

The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.

Grab the package with:

install.packages("igraph")

Load the data using:

data <-as.matrix(read.csv("social_train.csv", header = FALSE));
dg <- graph.edgelist(data, directed=TRUE)

Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.

[edit] Python

[edit] How to load the network into networkx

There is a network analysis package for Python called networkx. This package can be installed using easy_install.

The network can be loaded using the read_edgelist function in networkx or by manually adding edges

NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.

Method 1

import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')

Method 2

import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
    tmp1 = int(row[0])
    tmp2 = int(row[1])
    DG.add_edge(tmp1, tmp2)


print "Loaded in ", str(time.clock() - t0), "s"

Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!

Rows 1M 2M 3M
Method 1 20s 53s 103s
Method 2 15s 41s 86s


[edit] Ruby

[edit] Note on CSV Libraries

If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try FasterCSV (require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.

[edit] Loading Adjacency Lists

require 'rubygems'
require 'faster_csv'
def load_adj_list_faster(filename)
  adj_list_hash={}
  FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row|
    node_id=row.shift
    list_of_adj=row
    adj_list_hash[node_id] = list_of_adj
  end
  return adj_list_hash
end

adj_list_lookup = load_adj_list_faster('adj_list.out.csv')
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')
Personal tools