Machine Learning/Kaggle Social Network Contest

From Noisebridge
Jump to navigation Jump to search


Tasks Status target date subpage
Lit review started - Lit Review
Load data started 11/24 Load data
Describe network started 11/24 Network Description
Choose problem representation started 11/24 Problem Representation
Generate candidate features 0% 11/24 Features
fit to model 0% Model
Win competition 0% Prize Plan

Official Contest Links[edit]

Official Data Downloads[edit]

Key Contest Info[edit]

The data has been downloaded using the API of a social network. There are 7.2m contacts/edges of 38k users/nodes. These have been drawn randomly ensuring a certain level of closedness.

You are given 7,237,983 contacts/edges from a social network ( The first column is the outbound node and the second column is the inbound node. The ids have been encoded so that the users are anonymous. Ids reach from 1 to 1,133,547.

There are 37,689 outbound nodes and 1,133,518 inbound nodes. Most outbound nodes are also inbound nodes so that the total number of unique nodes is 1,133,547.

The way the contacts were sampled makes sure that the universe is roughly closed. Note that not every relationship is mutual.

The test dataset contains 8,960 edges from 8,960 unique outbound nodes (social_test.csv). Of those 4,480 are true and 4,480 are false edges. You are tasked to predict which are true (1) and which are false (0). You need to supply back a file with outbound node id,inbound node id,[0,1] in each row. This means you can assign a probability of being true to an edge. You are being scored on the AUC. A random model will have an AUC of 0.5, so you need to try to do better than that (ie have a higher AUC). Your entry should conform to the format in sample_submission.csv.

You are encouraged to explore techniques which explain the social network/graph. The best entrant should try to explain his approach/method to other users.

Don’t despair if your first couple of solutions score low, this is an explorative process.

Our Working Data Dumps[edit]

  • Adjacency list based from the training data: 
  First column: outbound vertex
  Remaining columns: list of vertices to which it points
  Note: Useful when loaded up as a hashtable keyed on outbound vertex returning the list.
  • Adjacency list of the reversed Graph:
  First column: inbound vertex
  Remaining columns: list of vertices which point to it
  Note: This is useful if interested in following the edges backwards quickly.
        This is useful to load as a hashtable keyed on inbound vertex returning the list.
  • Degree Features for all Nodes:
  First column: Node Id
  Second column: Outbound Degree (count of the number of outbound edges from node)
  Third column: Inbound Degree (count of the number of inbound edges to node)
  Note: You can think of these as number of followees and followers (respectively).
        Additionally, note that only the first 32.7k rows have 'followees'

Useful Links[edit]