# Machine Learning/Kaggle Social Network Contest/Features

From Noisebridge

< Machine Learning | Kaggle Social Network Contest(Difference between revisions)

(Created page with '== Possible Features == *nodeid *nodetofollowid *median path length *shortest distance from nodeid to nodetofollowid *inbound edges *outbound edges *clustering coefficient *recip…') |
(→Joe's attempt) |
||

(8 intermediate revisions by 2 users not shown) | |||

Line 1: | Line 1: | ||

+ | == TODO == | ||

+ | * Precisely define the listed features | ||

+ | |||

== Possible Features == | == Possible Features == | ||

− | *nodeid | + | *Node Features |

− | * | + | **nodeid |

− | * | + | **outdegree |

− | * | + | **indegree |

− | *inbound edges | + | **local clustering coefficient |

− | + | **reciprocation of inbound probability (num of edges returned / num of inbound edges) | |

− | * | + | **reciprocation of outbound probability (num of edges returned / num of outbound edges) |

− | *reciprocation probability (num of edges returned / num of outbound edges) | + | |

+ | *Edge Features | ||

+ | **nodetofollowid | ||

+ | **shortest distance nodeid to nodetofollowid | ||

+ | **density? (<strike>median path length</strike>) | ||

+ | **does reverse edge exist? (aka is nodetofollowid following nodeid?) | ||

+ | **number of common friends | ||

+ | **indegrees & outdegrees of nodetofollowid | ||

+ | |||

+ | * Network features | ||

+ | ** unweighted random walk score | ||

+ | ** global clustering coefficient | ||

+ | ** Adamic-Adar score | ||

+ | *** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper] | ||

+ | *** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted] | ||

+ | |||

+ | * Clustering | ||

+ | ** membership of the same strongly connected cluster | ||

+ | *** using [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clusters] | ||

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future | The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future | ||

+ | |||

+ | == Joe's attempt == | ||

+ | I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it. | ||

+ | |||

+ | For an edge from node s to node t I will calculate: | ||

+ | # is there a directed edge from t to s? | ||

+ | # the in-degree of s | ||

+ | # the out-degree of s | ||

+ | # the in-degree of t | ||

+ | # the out-degree of t | ||

+ | # RLD<sub>-1</sub>(s) | ||

+ | # RLD<sub>1</sub>(s) | ||

+ | # RLD<sub>0</sub>(s) | ||

+ | # RLD<sub>-1</sub>(t) | ||

+ | # RLD<sub>1</sub>(t) | ||

+ | # RLD<sub>0</sub>(t) | ||

+ | #AA<sub>0</sub><sup>1</sup>(s,t) | ||

+ | # AA<sub>0</sub><sup>1.5</sup>(s,t) | ||

+ | # AA<sub>0</sub><sup>2</sup>(s,t) | ||

+ | # AA<sub>-1</sub><sup>1</sup>(s,t) | ||

+ | # AA<sub>-1</sub><sup>1.5</sup>(s,t) | ||

+ | # AA<sub>-1</sub><sup>2</sup>(s,t) | ||

+ | # AA<sub>1</sub><sup>1</sup>(s,t) | ||

+ | # AA<sub>1</sub><sup>1.5</sup>(s,t) | ||

+ | # AA<sub>1</sub><sup>2</sup>(s,t) | ||

+ | |||

+ | where | ||

+ | * RLD<sub>x</sub>(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree ) | ||

+ | ** note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero | ||

+ | ** logs are taken to base e | ||

+ | |||

+ | I define N<sub>x</sub><sup>h</sup>(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1). | ||

+ | |||

+ | I define C<sub>x</sub><sup>h</sup>(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie | ||

+ | * C<sub>x</sub><sup>h</sup>(s,t) = (N<sub>x</sub><sup>h</sup>(s) ∩ N<sub>-x</sub><sup>h</sup>(t)) \ ∪<sub>h' < h </sub>(N<sub>x</sub><sup>h'</sup>(s) ∩ N<sub>-x</sub><sup>h'</sup>(t)) | ||

+ | ** h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s | ||

+ | * The sets C<sub>x</sub><sup>h</sup>(s,t) are distinct for different h. | ||

+ | * It is directional, ie sometimes C<sub>x</sub><sup>h</sup>(s,t)≠C<sub>x</sub><sup>h</sup>(t,s) | ||

+ | * AA is the Adamic-Adar score calculated over different common neighbourhoods. | ||

+ | ** the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively | ||

+ | ** the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is. | ||

+ | * AA<sub>x</sub><sup>h</sup>(s,t) = sum<sub>n ∈ C<sub>x</sub><sup>h</sup>(s,t)</sub> RLD<sub>0</sub>(n) |

## Latest revision as of 16:50, 25 November 2010

## [edit] TODO

- Precisely define the listed features

## [edit] Possible Features

- Node Features
- nodeid
- outdegree
- indegree
- local clustering coefficient
- reciprocation of inbound probability (num of edges returned / num of inbound edges)
- reciprocation of outbound probability (num of edges returned / num of outbound edges)

- Edge Features
- nodetofollowid
- shortest distance nodeid to nodetofollowid
- density? (
~~median path length~~) - does reverse edge exist? (aka is nodetofollowid following nodeid?)
- number of common friends
- indegrees & outdegrees of nodetofollowid

- Network features
- unweighted random walk score
- global clustering coefficient
- Adamic-Adar score
- see original paper
- R igraph: similarity.invlogweighted

- Clustering
- membership of the same strongly connected cluster
- using igraph clusters

- membership of the same strongly connected cluster

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

## [edit] Joe's attempt

I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:

- is there a directed edge from t to s?
- the in-degree of s
- the out-degree of s
- the in-degree of t
- the out-degree of t
- RLD
_{-1}(s) - RLD
_{1}(s) - RLD
_{0}(s) - RLD
_{-1}(t) - RLD
_{1}(t) - RLD
_{0}(t) - AA
_{0}^{1}(s,t) - AA
_{0}^{1.5}(s,t) - AA
_{0}^{2}(s,t) - AA
_{-1}^{1}(s,t) - AA
_{-1}^{1.5}(s,t) - AA
_{-1}^{2}(s,t) - AA
_{1}^{1}(s,t) - AA
_{1}^{1.5}(s,t) - AA
_{1}^{2}(s,t)

where

- RLD
_{x}(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )- note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
- logs are taken to base e

I define N_{x}^{h}(n) to be the nodes reachable from *n* in *h* hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define C_{x}^{h}(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie

- C
_{x}^{h}(s,t) = (N_{x}^{h}(s) ∩ N_{-x}^{h}(t)) \ ∪_{h' < h }(N_{x}^{h'}(s) ∩ N_{-x}^{h'}(t))- h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s

- The sets C
_{x}^{h}(s,t) are distinct for different h. - It is directional, ie sometimes C
_{x}^{h}(s,t)≠C_{x}^{h}(t,s) - AA is the Adamic-Adar score calculated over different common neighbourhoods.
- the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
- the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.

- AA
_{x}^{h}(s,t) = sum_{n ∈ Cxh(s,t)}RLD_{0}(n)