Machine Learning/Kaggle Social Network Contest/Features

From Noisebridge
Jump to: navigation, search


  • Precisely define the listed features

Possible Features

  • Node Features
    • nodeid
    • outdegree
    • indegree
    • local clustering coefficient
    • reciprocation of inbound probability (num of edges returned / num of inbound edges)
    • reciprocation of outbound probability (num of edges returned / num of outbound edges)
  • Edge Features
    • nodetofollowid
    • shortest distance nodeid to nodetofollowid
    • density? (median path length)
    • does reverse edge exist? (aka is nodetofollowid following nodeid?)
    • number of common friends
    • indegrees & outdegrees of nodetofollowid
  • Clustering

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

Joe's attempt

I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:

  1. the in-degree of s
  2. the out-degree of s
  3. the in-degree of t
  4. the out-degree of t
  5. RLD-1(s)
  6. RLD1(s)
  7. RLD0(s)
  8. RLD-1(t)
  9. RLD1(t)
  10. RLD0(t)
  11. AA01(s,t)
  12. AA01.5(s,t)
  13. AA02(s,t)
  14. AA-11(s,t)
  15. AA-11.5(s,t)
  16. AA-12(s,t)
  17. AA11(s,t)
  18. AA11.5(s,t)
  19. AA12(s,t)


  • RLDx(n) is 1 / log(the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )

I define Nxh(n) to be the nodes reachable from n in h hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define Cxh(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie

  • Cxh(s,t) = (Nxh(s) ∩ N-xh(t)) \ ∪h' < h (Nxh'(s) ∩ N-xh'(t))
    • h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
  • The sets Cxh(s,t) are distinct for different h.
  • It is directional, ie sometimes Cxh(s,t)≠Cxh(t,s)
  • AA is the Adamic-Adar score calculated over different common neighbourhoods.
    • the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
    • the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
  • AAxh(s,t) = sumn ∈ Cxh(s,t) RLD0(n)
Personal tools