Machine Learning/Kaggle Social Network Contest
|Lit review||started||-||Lit Review|
|Load data||started||11/24||Load data|
|Choose problem representation||started||11/24||Problem Representation|
|Generate candidate features||0%||11/24||Features|
|fit to model||0%||Model|
|Win competition||0%||Prize Plan|
Official Contest Links
- Overview: http://kaggle.com/socialNetwork
- Data Details: http://kaggle.com/socialNetwork?viewtype=data (login required)
Official Data Downloads
- Official Training Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_train.csv
- Official Test Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_test.txt
- Official Sample Submission File: http://dl.dropbox.com/u/14895843/social-network-kaggle/sample_submission.csv
Key Contest Info
The data has been downloaded using the API of a social network. There are 7.2m contacts/edges of 38k users/nodes. These have been drawn randomly ensuring a certain level of closedness.
You are given 7,237,983 contacts/edges from a social network (social_train.zip). The first column is the outbound node and the second column is the inbound node. The ids have been encoded so that the users are anonymous. Ids reach from 1 to 1,133,547.
There are 37,689 outbound nodes and 1,133,518 inbound nodes. Most outbound nodes are also inbound nodes so that the total number of unique nodes is 1,133,547.
The way the contacts were sampled makes sure that the universe is roughly closed. Note that not every relationship is mutual.
The test dataset contains 8,960 edges from 8,960 unique outbound nodes (social_test.csv). Of those 4,480 are true and 4,480 are false edges. You are tasked to predict which are true (1) and which are false (0). You need to supply back a file with outbound node id,inbound node id,[0,1] in each row. This means you can assign a probability of being true to an edge. You are being scored on the AUC. A random model will have an AUC of 0.5, so you need to try to do better than that (ie have a higher AUC). Your entry should conform to the format in sample_submission.csv.
You are encouraged to explore techniques which explain the social network/graph. The best entrant should try to explain his approach/method to other users.
Don’t despair if your first couple of solutions score low, this is an explorative process.
Our Working Data Dumps
- Adjacency list based from the training data:
http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv First column: outbound vertex Remaining columns: list of vertices to which it points Note: Useful when loaded up as a hashtable keyed on outbound vertex returning the list.
- Adjacency list of the reversed Graph:
http://dl.dropbox.com/u/14895843/social-network-kaggle/reverse_adj_list.out.csv First column: inbound vertex Remaining columns: list of vertices which point to it Note: This is useful if interested in following the edges backwards quickly. This is useful to load as a hashtable keyed on inbound vertex returning the list.
- Degree Features for all Nodes:
http://dl.dropbox.com/u/14895843/social-network-kaggle/node_degree_features.csv First column: Node Id Second column: Outbound Degree (count of the number of outbound edges from node) Third column: Inbound Degree (count of the number of inbound edges to node) Note: You can think of these as number of followees and followers (respectively). Additionally, note that only the first 32.7k rows have 'followees'
- [| Supervised Random Walks: Predicting and Recommending Links in Social Networks]
- Matrix Digraph Algs: http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/graphIntro.htm
- "Strongly Connected Components": http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/strongComponent.htm
- [| Another Google Book on Social Network Analysis]