Machine Learning/Kaggle Social Network Contest/Features: Difference between revisions

Latest revision as of 16:50, 25 November 2010

TODO[edit]

Precisely define the listed features

Possible Features[edit]

Node Features
- nodeid
- outdegree
- indegree
- local clustering coefficient
- reciprocation of inbound probability (num of edges returned / num of inbound edges)
- reciprocation of outbound probability (num of edges returned / num of outbound edges)

Edge Features
- nodetofollowid
- shortest distance nodeid to nodetofollowid
- density? (~~median path length~~)
- does reverse edge exist? (aka is nodetofollowid following nodeid?)
- number of common friends
- indegrees & outdegrees of nodetofollowid

Network features
- unweighted random walk score
- global clustering coefficient
- Adamic-Adar score
  - see original paper
  - R igraph: similarity.invlogweighted

Clustering
- membership of the same strongly connected cluster
  - using igraph clusters

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

Joe's attempt[edit]

I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:

is there a directed edge from t to s?
the in-degree of s
the out-degree of s
the in-degree of t
the out-degree of t
RLD_-1(s)
RLD₁(s)
RLD₀(s)
RLD_-1(t)
RLD₁(t)
RLD₀(t)
AA₀¹(s,t)
AA₀^1.5(s,t)
AA₀²(s,t)
AA_-1¹(s,t)
AA_-1^1.5(s,t)
AA_-1²(s,t)
AA₁¹(s,t)
AA₁^1.5(s,t)
AA₁²(s,t)

where

RLD_x(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
- note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
- logs are taken to base e

I define N_x^h(n) to be the nodes reachable from n in h hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define C_x^h(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie

C_x^h(s,t) = (N_x^h(s) ∩ N_-x^h(t)) \ ∪_{h' < h}(N_x^h'(s) ∩ N_-x^h'(t))
- h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
The sets C_x^h(s,t) are distinct for different h.
It is directional, ie sometimes C_x^h(s,t)≠C_x^h(t,s)
AA is the Adamic-Adar score calculated over different common neighbourhoods.
- the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
- the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
AA_x^h(s,t) = sum_{n ∈ C_x^h(s,t)} RLD₀(n)

@@ Line 1: / Line 1: @@
+== TODO ==
+* Precisely define the listed features
 == Possible Features ==
-*nodeid
+*Node Features
-*nodetofollowid
+**nodeid
-*median path length
+**outdegree
-*shortest distance from nodeid to nodetofollowid
+**indegree
-*inbound edges
+**local clustering coefficient
-*outbound edges
+**reciprocation of inbound probability (num of edges returned / num of inbound edges)
-*clustering coefficient
+**reciprocation of outbound probability (num of edges returned / num of outbound edges)
-*reciprocation probability (num of edges returned / num of outbound edges)
+*Edge Features
+**nodetofollowid
+**shortest distance nodeid to nodetofollowid
+**density? (<strike>median path length</strike>)
+**does reverse edge exist? (aka is nodetofollowid following nodeid?)
+**number of common friends
+**indegrees & outdegrees of nodetofollowid
+* Network features
+** unweighted random walk score
+** global clustering coefficient
+** Adamic-Adar score
+*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper]
+*** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted]
+* Clustering
+** membership of the same strongly connected cluster
+*** using [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clusters]
 The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future
+== Joe's attempt ==
+I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.
+For an edge from node s to node t I will calculate:
+# is there a directed edge from t to s?
+# the in-degree of s
+# the out-degree of s
+# the in-degree of t
+# the out-degree of t
+# RLD<sub>-1</sub>(s)
+# RLD<sub>1</sub>(s)
+# RLD<sub>0</sub>(s)
+# RLD<sub>-1</sub>(t)
+# RLD<sub>1</sub>(t)
+# RLD<sub>0</sub>(t)
+#AA<sub>0</sub><sup>1</sup>(s,t)
+# AA<sub>0</sub><sup>1.5</sup>(s,t)
+# AA<sub>0</sub><sup>2</sup>(s,t)
+# AA<sub>-1</sub><sup>1</sup>(s,t)
+# AA<sub>-1</sub><sup>1.5</sup>(s,t)
+# AA<sub>-1</sub><sup>2</sup>(s,t)
+# AA<sub>1</sub><sup>1</sup>(s,t)
+# AA<sub>1</sub><sup>1.5</sup>(s,t)
+# AA<sub>1</sub><sup>2</sup>(s,t)
+where
+* RLD<sub>x</sub>(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
+** note that I add 0.1 so that nodes with degree 1 have a score of  1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
+** logs are taken to base e
+I define N<sub>x</sub><sup>h</sup>(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1)  or edges from s towards t (x = 1).
+I define C<sub>x</sub><sup>h</sup>(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie
+* C<sub>x</sub><sup>h</sup>(s,t) = (N<sub>x</sub><sup>h</sup>(s) ∩ N<sub>-x</sub><sup>h</sup>(t)) \ ∪<sub>h' < h </sub>(N<sub>x</sub><sup>h'</sup>(s) ∩ N<sub>-x</sub><sup>h'</sup>(t))
+** h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
+* The sets C<sub>x</sub><sup>h</sup>(s,t) are distinct for different h.
+* It is directional, ie sometimes C<sub>x</sub><sup>h</sup>(s,t)≠C<sub>x</sub><sup>h</sup>(t,s)
+* AA is the Adamic-Adar score calculated over different common neighbourhoods.
+** the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
+** the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
+* AA<sub>x</sub><sup>h</sup>(s,t) = sum<sub>n ∈ C<sub>x</sub><sup>h</sup>(s,t)</sub> RLD<sub>0</sub>(n)

Machine Learning/Kaggle Social Network Contest/Features: Difference between revisions

Latest revision as of 16:50, 25 November 2010

TODO[edit]

Possible Features[edit]

Joe's attempt[edit]

Navigation menu

Search