Machine Learning/Kaggle Social Network Contest/Features: Difference between revisions
Jump to navigation
Jump to search
(3 intermediate revisions by the same user not shown) | |||
Line 25: | Line 25: | ||
*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper] | *** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper] | ||
*** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted] | *** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted] | ||
* Clustering | |||
** membership of the same strongly connected cluster | |||
*** using [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clusters] | |||
The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future | The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future | ||
== Joe's attempt == | |||
I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it. | |||
For an edge from node s to node t I will calculate: | |||
# is there a directed edge from t to s? | |||
# the in-degree of s | |||
# the out-degree of s | |||
# the in-degree of t | |||
# the out-degree of t | |||
# RLD<sub>-1</sub>(s) | |||
# RLD<sub>1</sub>(s) | |||
# RLD<sub>0</sub>(s) | |||
# RLD<sub>-1</sub>(t) | |||
# RLD<sub>1</sub>(t) | |||
# RLD<sub>0</sub>(t) | |||
#AA<sub>0</sub><sup>1</sup>(s,t) | |||
# AA<sub>0</sub><sup>1.5</sup>(s,t) | |||
# AA<sub>0</sub><sup>2</sup>(s,t) | |||
# AA<sub>-1</sub><sup>1</sup>(s,t) | |||
# AA<sub>-1</sub><sup>1.5</sup>(s,t) | |||
# AA<sub>-1</sub><sup>2</sup>(s,t) | |||
# AA<sub>1</sub><sup>1</sup>(s,t) | |||
# AA<sub>1</sub><sup>1.5</sup>(s,t) | |||
# AA<sub>1</sub><sup>2</sup>(s,t) | |||
where | |||
* RLD<sub>x</sub>(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree ) | |||
** note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero | |||
** logs are taken to base e | |||
I define N<sub>x</sub><sup>h</sup>(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1). | |||
I define C<sub>x</sub><sup>h</sup>(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie | |||
* C<sub>x</sub><sup>h</sup>(s,t) = (N<sub>x</sub><sup>h</sup>(s) ∩ N<sub>-x</sub><sup>h</sup>(t)) \ ∪<sub>h' < h </sub>(N<sub>x</sub><sup>h'</sup>(s) ∩ N<sub>-x</sub><sup>h'</sup>(t)) | |||
** h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s | |||
* The sets C<sub>x</sub><sup>h</sup>(s,t) are distinct for different h. | |||
* It is directional, ie sometimes C<sub>x</sub><sup>h</sup>(s,t)≠C<sub>x</sub><sup>h</sup>(t,s) | |||
* AA is the Adamic-Adar score calculated over different common neighbourhoods. | |||
** the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively | |||
** the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is. | |||
* AA<sub>x</sub><sup>h</sup>(s,t) = sum<sub>n ∈ C<sub>x</sub><sup>h</sup>(s,t)</sub> RLD<sub>0</sub>(n) |
Latest revision as of 16:50, 25 November 2010
TODO[edit]
- Precisely define the listed features
Possible Features[edit]
- Node Features
- nodeid
- outdegree
- indegree
- local clustering coefficient
- reciprocation of inbound probability (num of edges returned / num of inbound edges)
- reciprocation of outbound probability (num of edges returned / num of outbound edges)
- Edge Features
- nodetofollowid
- shortest distance nodeid to nodetofollowid
- density? (
median path length) - does reverse edge exist? (aka is nodetofollowid following nodeid?)
- number of common friends
- indegrees & outdegrees of nodetofollowid
- Network features
- unweighted random walk score
- global clustering coefficient
- Adamic-Adar score
- see original paper
- R igraph: similarity.invlogweighted
- Clustering
- membership of the same strongly connected cluster
- using igraph clusters
- membership of the same strongly connected cluster
The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future
Joe's attempt[edit]
I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.
For an edge from node s to node t I will calculate:
- is there a directed edge from t to s?
- the in-degree of s
- the out-degree of s
- the in-degree of t
- the out-degree of t
- RLD-1(s)
- RLD1(s)
- RLD0(s)
- RLD-1(t)
- RLD1(t)
- RLD0(t)
- AA01(s,t)
- AA01.5(s,t)
- AA02(s,t)
- AA-11(s,t)
- AA-11.5(s,t)
- AA-12(s,t)
- AA11(s,t)
- AA11.5(s,t)
- AA12(s,t)
where
- RLDx(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
- note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
- logs are taken to base e
I define Nxh(n) to be the nodes reachable from n in h hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).
I define Cxh(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie
- Cxh(s,t) = (Nxh(s) ∩ N-xh(t)) \ ∪h' < h (Nxh'(s) ∩ N-xh'(t))
- h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
- The sets Cxh(s,t) are distinct for different h.
- It is directional, ie sometimes Cxh(s,t)≠Cxh(t,s)
- AA is the Adamic-Adar score calculated over different common neighbourhoods.
- the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
- the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
- AAxh(s,t) = sumn ∈ Cxh(s,t) RLD0(n)