# Machine Learning/Kaggle Social Network Contest/Features

(Difference between revisions)

## TODO

• Precisely define the listed features

## Possible Features

• Node Features
• nodeid
• outdegree
• indegree
• local clustering coefficient
• reciprocation of inbound probability (num of edges returned / num of inbound edges)
• reciprocation of outbound probability (num of edges returned / num of outbound edges)
• Edge Features
• nodetofollowid
• shortest distance nodeid to nodetofollowid
• density? (median path length)
• does reverse edge exist? (aka is nodetofollowid following nodeid?)
• number of common friends
• indegrees & outdegrees of nodetofollowid
• Clustering

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

## Joe's attempt

I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:

1. the in-degree of s
2. the out-degree of s
3. the in-degree of t
4. the out-degree of t
5. RLD-1(s)
6. RLD1(s)
7. RLD0(s)
8. RLD-1(t)
9. RLD1(t)
10. RLD0(t)
11. AA01(s,t)
12. AA01.5(s,t)
13. AA02(s,t)
14. AA-11(s,t)
15. AA-11.5(s,t)
16. AA-12(s,t)
17. AA11(s,t)
18. AA11.5(s,t)
19. AA12(s,t)

where

• RLDx(n) is 1 / log(the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )

I define Nxh(n) to be the nodes reachable from n in h hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define Cxh(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie

• Cxh(s,t) = (Nxh(s) ∩ N-xh(t)) \ ∪h' < h (Nxh'(s) ∩ N-xh'(t))
• h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
• The sets Cxh(s,t) are distinct for different h.
• It is directional, ie sometimes Cxh(s,t)≠Cxh(t,s)
• AA is the Adamic-Adar score calculated over different common neighbourhoods.
• the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
• the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
• AAxh(s,t) = sumn ∈ Cxh(s,t) RLD0(n)