Machine Learning/Kaggle Social Network Contest/load data: Difference between revisions
No edit summary |
(→igraph) |
||
(7 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
= R = | |||
== igraph == | |||
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM. | |||
Grab the package with: | |||
<pre> | |||
install.packages("igraph") | |||
</pre> | |||
Load the data using: | |||
<pre> | |||
data <-as.matrix(read.csv("social_train.csv", header = FALSE)); | |||
dg <- graph.edgelist(data, directed=TRUE) | |||
</pre> | |||
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges. | |||
=Python= | |||
== How to load the network into networkx == | == How to load the network into networkx == | ||
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install. | There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install. | ||
Line 4: | Line 22: | ||
The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges | The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges | ||
Method 1 | NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks. | ||
'''Method 1''' | |||
<pre> | <pre> | ||
import networkx as nx | import networkx as nx | ||
Line 10: | Line 30: | ||
</pre> | </pre> | ||
'''Method 2''' | |||
Method 2 | |||
<pre> | <pre> | ||
import networkx as nx | import networkx as nx | ||
Line 32: | Line 50: | ||
</pre> | </pre> | ||
{| border="1" | Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods! | ||
{| border="1" | |||
|- | |- | ||
!|Rows | !|Rows | ||
Line 49: | Line 68: | ||
| 86s | | 86s | ||
|} | |} | ||
= Ruby = | |||
== Note on CSV Libraries == | |||
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV. | |||
== Loading Adjacency Lists == | |||
<pre> | |||
require 'rubygems' | |||
require 'faster_csv' | |||
def load_adj_list_faster(filename) | |||
adj_list_hash={} | |||
FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row| | |||
node_id=row.shift | |||
list_of_adj=row | |||
adj_list_hash[node_id] = list_of_adj | |||
end | |||
return adj_list_hash | |||
end | |||
adj_list_lookup = load_adj_list_faster('adj_list.out.csv') | |||
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv') | |||
</pre> |
Latest revision as of 13:28, 23 November 2010
R[edit]
igraph[edit]
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.
Grab the package with:
install.packages("igraph")
Load the data using:
data <-as.matrix(read.csv("social_train.csv", header = FALSE)); dg <- graph.edgelist(data, directed=TRUE)
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.
Python[edit]
How to load the network into networkx[edit]
There is a network analysis package for Python called networkx. This package can be installed using easy_install.
The network can be loaded using the read_edgelist function in networkx or by manually adding edges
NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.
Method 1
import networkx as nx DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
Method 2
import networkx as nx import csv import time t0 = time.clock() DG = nx.DiGraph() netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',') for row in netcsv: tmp1 = int(row[0]) tmp2 = int(row[1]) DG.add_edge(tmp1, tmp2) print "Loaded in ", str(time.clock() - t0), "s"
Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!
Rows | 1M | 2M | 3M |
---|---|---|---|
Method 1 | 20s | 53s | 103s |
Method 2 | 15s | 41s | 86s |
Ruby[edit]
Note on CSV Libraries[edit]
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.
Loading Adjacency Lists[edit]
require 'rubygems' require 'faster_csv' def load_adj_list_faster(filename) adj_list_hash={} FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row| node_id=row.shift list_of_adj=row adj_list_hash[node_id] = list_of_adj end return adj_list_hash end adj_list_lookup = load_adj_list_faster('adj_list.out.csv') rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')