Editing Machine Learning/Kaggle Social Network Contest/load data

Jump to navigation Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
= R =
== igraph ==
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.
Grab the package with:
<pre>
install.packages("igraph")
</pre>
Load the data using:
<pre>
data <-as.matrix(read.csv("social_train.csv", header = FALSE));
dg <- graph.edgelist(data, directed=TRUE)
</pre>
Note that the resulting graph contains an additional vertex with id zero. If you delete this vertex the id names will not be preserved, and so it is a good idea to just leave it in there. The vertex zero has no edges.
=Python=
== How to load the network into networkx ==
== How to load the network into networkx ==
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.
Line 22: Line 4:
The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges
The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges


NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.
Method 1
 
'''Method 1'''
<pre>
<pre>
import networkx as nx
import networkx as nx
Line 30: Line 10:
</pre>
</pre>


'''Method 2'''
 
 
Method 2
<pre>
<pre>
import networkx as nx
import networkx as nx
Line 50: Line 32:
</pre>
</pre>


Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core  machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous  analysis of the methods!
{| border="1"
{| border="1"  
|-
|-
!|Rows
!|Rows
Line 68: Line 49:
| 86s
| 86s
|}
|}
= Ruby =
== Note on CSV Libraries ==
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv').  For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.
== Loading Adjacency Lists ==
<pre>
require 'rubygems'
require 'faster_csv'
def load_adj_list_faster(filename)
  adj_list_hash={}
  FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row|
    node_id=row.shift
    list_of_adj=row
    adj_list_hash[node_id] = list_of_adj
  end
  return adj_list_hash
end
adj_list_lookup = load_adj_list_faster('adj_list.out.csv')
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')
</pre>
Please note that all contributions to Noisebridge are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see Noisebridge:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)