Machine Learning/Kaggle Social Network Contest/Features

From Noisebridge
< Machine Learning | Kaggle Social Network Contest(Difference between revisions)
Jump to: navigation, search
(Joe's attempt)
 
(7 intermediate revisions by 2 users not shown)
Line 3: Line 3:
  
 
== Possible Features ==
 
== Possible Features ==
*nodeid
+
*Node Features
*nodetofollowid
+
**nodeid
*median path length
+
**outdegree
*shortest distance from nodeid to nodetofollowid
+
**indegree
*inbound edges
+
**local clustering coefficient
*outbound edges
+
**reciprocation of inbound probability (num of edges returned / num of inbound edges)
*clustering coefficient
+
**reciprocation of outbound probability (num of edges returned / num of outbound edges)
*reciprocation probability (num of edges returned / num of outbound edges)
+
  
The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future
+
*Edge Features
 +
**nodetofollowid
 +
**shortest distance nodeid to nodetofollowid
 +
**density? (<strike>median path length</strike>)
 +
**does reverse edge exist? (aka is nodetofollowid following nodeid?)
 +
**number of common friends
 +
**indegrees & outdegrees of nodetofollowid
  
From the Backstrom and Leskovec, for a node s and a potential target c
 
 
* Network features
 
* Network features
 
** unweighted random walk score
 
** unweighted random walk score
 +
** global clustering coefficient
 
** Adamic-Adar score
 
** Adamic-Adar score
** number of common friends
+
*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper]
** indegrees and outdegrees of s
+
*** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted]
*** the indegree is the number of edges coming into node s
+
 
*** the outdegree is the number of edges leaving node s
+
* Clustering
** indegrees and outdegrees of c
+
** membership of the same strongly connected cluster
 +
*** using [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clusters]
 +
 
 +
The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future
 +
 
 +
== Joe's attempt ==
 +
I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.
 +
 
 +
For an edge from node s to node t I will calculate:
 +
# is there a directed edge from t to s?
 +
# the in-degree of s
 +
# the out-degree of s
 +
# the in-degree of t
 +
# the out-degree of t
 +
# RLD<sub>-1</sub>(s)
 +
# RLD<sub>1</sub>(s)
 +
# RLD<sub>0</sub>(s)
 +
# RLD<sub>-1</sub>(t)
 +
# RLD<sub>1</sub>(t)
 +
# RLD<sub>0</sub>(t)
 +
#AA<sub>0</sub><sup>1</sup>(s,t)
 +
# AA<sub>0</sub><sup>1.5</sup>(s,t)
 +
# AA<sub>0</sub><sup>2</sup>(s,t)
 +
# AA<sub>-1</sub><sup>1</sup>(s,t)
 +
# AA<sub>-1</sub><sup>1.5</sup>(s,t)
 +
# AA<sub>-1</sub><sup>2</sup>(s,t)
 +
# AA<sub>1</sub><sup>1</sup>(s,t)
 +
# AA<sub>1</sub><sup>1.5</sup>(s,t)
 +
# AA<sub>1</sub><sup>2</sup>(s,t)
 +
 
 +
where
 +
* RLD<sub>x</sub>(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
 +
** note that I add 0.1 so that nodes with degree 1 have a score of  1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
 +
** logs are taken to base e
 +
 
 +
I define N<sub>x</sub><sup>h</sup>(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1)  or edges from s towards t (x = 1).
 +
 
 +
I define C<sub>x</sub><sup>h</sup>(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie
 +
* C<sub>x</sub><sup>h</sup>(s,t) = (N<sub>x</sub><sup>h</sup>(s) ∩ N<sub>-x</sub><sup>h</sup>(t)) \ ∪<sub>h' < h </sub>(N<sub>x</sub><sup>h'</sup>(s) ∩ N<sub>-x</sub><sup>h'</sup>(t))
 +
** h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
 +
* The sets C<sub>x</sub><sup>h</sup>(s,t) are distinct for different h.
 +
* It is directional, ie sometimes C<sub>x</sub><sup>h</sup>(s,t)≠C<sub>x</sub><sup>h</sup>(t,s)
 +
* AA is the Adamic-Adar score calculated over different common neighbourhoods.
 +
** the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
 +
** the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
 +
* AA<sub>x</sub><sup>h</sup>(s,t) = sum<sub>n ∈ C<sub>x</sub><sup>h</sup>(s,t)</sub> RLD<sub>0</sub>(n)

Latest revision as of 16:50, 25 November 2010

[edit] TODO

  • Precisely define the listed features

[edit] Possible Features

  • Node Features
    • nodeid
    • outdegree
    • indegree
    • local clustering coefficient
    • reciprocation of inbound probability (num of edges returned / num of inbound edges)
    • reciprocation of outbound probability (num of edges returned / num of outbound edges)
  • Edge Features
    • nodetofollowid
    • shortest distance nodeid to nodetofollowid
    • density? (median path length)
    • does reverse edge exist? (aka is nodetofollowid following nodeid?)
    • number of common friends
    • indegrees & outdegrees of nodetofollowid
  • Clustering

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

[edit] Joe's attempt

I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:

  1. is there a directed edge from t to s?
  2. the in-degree of s
  3. the out-degree of s
  4. the in-degree of t
  5. the out-degree of t
  6. RLD-1(s)
  7. RLD1(s)
  8. RLD0(s)
  9. RLD-1(t)
  10. RLD1(t)
  11. RLD0(t)
  12. AA01(s,t)
  13. AA01.5(s,t)
  14. AA02(s,t)
  15. AA-11(s,t)
  16. AA-11.5(s,t)
  17. AA-12(s,t)
  18. AA11(s,t)
  19. AA11.5(s,t)
  20. AA12(s,t)

where

  • RLDx(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
    • note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
    • logs are taken to base e

I define Nxh(n) to be the nodes reachable from n in h hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define Cxh(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie

  • Cxh(s,t) = (Nxh(s) ∩ N-xh(t)) \ ∪h' < h (Nxh'(s) ∩ N-xh'(t))
    • h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
  • The sets Cxh(s,t) are distinct for different h.
  • It is directional, ie sometimes Cxh(s,t)≠Cxh(t,s)
  • AA is the Adamic-Adar score calculated over different common neighbourhoods.
    • the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
    • the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
  • AAxh(s,t) = sumn ∈ Cxh(s,t) RLD0(n)
Personal tools