Machine Learning/Kaggle Social Network Contest/Features: Difference between revisions

From Noisebridge
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 25: Line 25:
*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper]
*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper]
*** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted]
*** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted]
* Clustering
** membership of the same strongly connected cluster
*** using [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clusters]


The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future
The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future
== Joe's attempt ==
I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.
For an edge from node s to node t I will calculate:
# is there a directed edge from t to s?
# the in-degree of s
# the out-degree of s
# the in-degree of t
# the out-degree of t
# RLD<sub>-1</sub>(s)
# RLD<sub>1</sub>(s)
# RLD<sub>0</sub>(s)
# RLD<sub>-1</sub>(t)
# RLD<sub>1</sub>(t)
# RLD<sub>0</sub>(t)
#AA<sub>0</sub><sup>1</sup>(s,t)
# AA<sub>0</sub><sup>1.5</sup>(s,t)
# AA<sub>0</sub><sup>2</sup>(s,t)
# AA<sub>-1</sub><sup>1</sup>(s,t)
# AA<sub>-1</sub><sup>1.5</sup>(s,t)
# AA<sub>-1</sub><sup>2</sup>(s,t)
# AA<sub>1</sub><sup>1</sup>(s,t)
# AA<sub>1</sub><sup>1.5</sup>(s,t)
# AA<sub>1</sub><sup>2</sup>(s,t)
where
* RLD<sub>x</sub>(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
** note that I add 0.1 so that nodes with degree 1 have a score of  1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
** logs are taken to base e
I define N<sub>x</sub><sup>h</sup>(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1)  or edges from s towards t (x = 1).
I define C<sub>x</sub><sup>h</sup>(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie
* C<sub>x</sub><sup>h</sup>(s,t) = (N<sub>x</sub><sup>h</sup>(s) ∩ N<sub>-x</sub><sup>h</sup>(t)) \ ∪<sub>h' < h </sub>(N<sub>x</sub><sup>h'</sup>(s) ∩ N<sub>-x</sub><sup>h'</sup>(t))
** h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
* The sets C<sub>x</sub><sup>h</sup>(s,t) are distinct for different h.
* It is directional, ie sometimes C<sub>x</sub><sup>h</sup>(s,t)≠C<sub>x</sub><sup>h</sup>(t,s)
* AA is the Adamic-Adar score calculated over different common neighbourhoods.
** the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
** the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
* AA<sub>x</sub><sup>h</sup>(s,t) = sum<sub>n ∈ C<sub>x</sub><sup>h</sup>(s,t)</sub> RLD<sub>0</sub>(n)

Latest revision as of 16:50, 25 November 2010

TODO[edit]

  • Precisely define the listed features

Possible Features[edit]

  • Node Features
    • nodeid
    • outdegree
    • indegree
    • local clustering coefficient
    • reciprocation of inbound probability (num of edges returned / num of inbound edges)
    • reciprocation of outbound probability (num of edges returned / num of outbound edges)
  • Edge Features
    • nodetofollowid
    • shortest distance nodeid to nodetofollowid
    • density? (median path length)
    • does reverse edge exist? (aka is nodetofollowid following nodeid?)
    • number of common friends
    • indegrees & outdegrees of nodetofollowid
  • Clustering

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

Joe's attempt[edit]

I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:

  1. is there a directed edge from t to s?
  2. the in-degree of s
  3. the out-degree of s
  4. the in-degree of t
  5. the out-degree of t
  6. RLD-1(s)
  7. RLD1(s)
  8. RLD0(s)
  9. RLD-1(t)
  10. RLD1(t)
  11. RLD0(t)
  12. AA01(s,t)
  13. AA01.5(s,t)
  14. AA02(s,t)
  15. AA-11(s,t)
  16. AA-11.5(s,t)
  17. AA-12(s,t)
  18. AA11(s,t)
  19. AA11.5(s,t)
  20. AA12(s,t)

where

  • RLDx(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
    • note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
    • logs are taken to base e

I define Nxh(n) to be the nodes reachable from n in h hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define Cxh(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie

  • Cxh(s,t) = (Nxh(s) ∩ N-xh(t)) \ ∪h' < h (Nxh'(s) ∩ N-xh'(t))
    • h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
  • The sets Cxh(s,t) are distinct for different h.
  • It is directional, ie sometimes Cxh(s,t)≠Cxh(t,s)
  • AA is the Adamic-Adar score calculated over different common neighbourhoods.
    • the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
    • the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
  • AAxh(s,t) = sumn ∈ Cxh(s,t) RLD0(n)