[ml] SocialPageRank
Shahin Saneinejad
ssaneine at gmail.com
Wed Dec 1 23:36:55 PST 2010
I've posted a bare-bones pagerank script in R here:
https://github.com/shahin/kaggle-socialnetwork/blob/master/pagerank.r
I ran it like this:
>> cat pagerank.r | R64 --vanilla
Can anyone verify my implementation? I'm pretty confident that it converges
as it should for irreducible graphs, but I'm not sure that I implemented the
dampening factor correctly. The number of iterations is hard-coded -- no
convergence detection here yet.
Also, I couldn't find a link to the group source repository on the wiki. Let
me know if I should be putting this stuff somewhere else.
Thanks.
Shahin
On Fri, Nov 26, 2010 at 8:08 PM, Shahin Saneinejad <ssaneine at gmail.com>wrote:
> Cool, thanks a lot for the writeup Jared. I'll do it in R and we can
> check our work.
>
> It's not clear to me that we'll get the convergence we want if we
> remove the dampening factor. Our transition matrix won't be
> irreducible (Joe found 27 disconnected subgraphs, yeah?) so I don't
> think we're guaranteed a unique stationary distribution [1]. It might
> be hard to verify the implementation without this. I haven't gone
> through the proofs, so if this isn't usually a problem for convergence
> then I hope someone will correct me.
>
> In any case, I think we do want our random walk to have a small
> non-zero chance of crossing disconnected subgraphs because their
> number falls from 27 to 22 when the test edges are added in. Hopefully
> a constant dampening factor won't slow it down much. There's probably
> no extra I/O involved.
>
> Comparing PR(t=i) and PR(t=i-1) to check for convergence makes sense
> to me. I might also check to see how much the relative rank order
> (rather than the absolute probabilities) changes between iterations.
> [1] says they used 25 iterations, although their initial state was
> weird.
>
> Shahin
>
> [1] "Topic Sensitive PageRank", T. Haveliwala,
>
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.110.1183&rep=rep1&type=pdf
>
>
> On Friday, November 26, 2010, Jared Dunne <jareddunne at gmail.com> wrote:
> > Shahin-
> >
> > Quasi-PageRank meaning:
> > - Semantically, It's not the actual PageRank algorithmn a la Google,
> since that's patented and so its doubtful that I've implemented exactly
> their PageRank alg. Also, PageRank is a strange name for this feature in
> our domain, so I think SocialRank makes more sense.
> > - Semantics aside, I also removed the dampening factor, since I suspect
> that's only needed when your random walk will not visit all nodes. Since I
> can update all Nodes' values with each iteration, I left it out since it
> would make each iteration have a slower effect in updating values. Or in
> other words, I've set d=1.0. For nodes with no inbound edges, this means
> that their PR is 0.0 after one iteration, instead of approaching zero over
> time.
> >
> > It's not in R. It's in Ruby and dumping the output to csv after each
> iteration. I'd be happy to reformat my output to something yummy for R to
> gobble up if csv is not ideal.
> >
> > I'm not taking the approach Mike presented (power method), but rather an
> iterative method that should result in same calculations. More or less
> described in the "Iterative" section here:
> http://en.wikipedia.org/wiki/PageRank#Computation (but like I said with
> dampening removed). "The basic mathematical operations performed in the
> iterative method and the power method are identical."
> >
> > Another caveat is that I'm equally initializing PR(t=0) for each node at
> equal values (1/N where N=number of nodes) instead of something related to
> number of inbound edges versus outbound as Mike's talk suggested. This is
> me keeping things simple to start, so I can verify correctness more easier.
> I can always rerun with different initial values if we determine that's
> better.
> >
> > I've calculated 50 iterations deep so far, which isn't computationally
> difficult with my code (under an hour without any dumping, under two hours
> if dumping all iterations). However, I think I have a correctness issue
> that I need to fix and then re-run. Namely, I was keeping my PR values in
> one hash and updating it as I iterate but i think I need to keep separate
> hashes, one for the current timestep and another for the previous timestep.
> This would unfairly diminish the PR values of the nodes that are iterated on
> last. In theory since, my order changes with each iteration, this won't
> have a major effect, but it's easy enough for me to correct, so I will.
> >
> > More to follow later this week once I fix my bug and rerun... At that
> time I'll post my code and sample data. That said, in the meantime if you
> want to try your hand at it in R using power method, go for it! We can
> always compare results to confirm we are finding the same nodes to be most
> "important". (So far node #146725 has the highest SocialRank value in our
> network, while #887961 is the lowest (non-zero) PR)
> >
> > One thing I could use help with is knowing when I can stop iterating....
> Any ideas on that? I'll be able to compare PR(t=i) and PR(t=i-1) for a
> given node in constant time, if that helps...
> >
> > Jared-
> >
> > On Fri, Nov 26, 2010 at 1:42 PM, Shahin Saneinejad <ssaneine at gmail.com>
> wrote:
> >
> > Hey Jared,
> > Is this in R? Are you using power iteration? I'm considering both these
> approaches for memory footprint and code simplicity reasons, respectively
> (my laptop is a little long in the tooth).
> >
> >
> > Also, what's "quasi" about your quasi-PageRank feature?
> > Shahin
> >
> > On Thu, Nov 25, 2010 at 4:11 PM, Jared Dunne <jareddunne at gmail.com>
> wrote:
> > I've implemented an iterative approach to quasi-PageRank feature
> > generation for all nodes. Each iteration across all nodes is under a
> > minute each time. I still need to fix some things with my approach and
> > possibly get some help sanity-checking the correctness of my algorithm.
> > I'll post more later, but I just wanted to let everyone know that I'm
> > making good progress on the PageRank feature generation, so that we
> > aren't duplicating our efforts. Gonna pick this back up later this
> > weekend. Enjoy your feasts.
> >
> > Jared-
> >
> > On Wed, Nov 24, 2010 at 6:23 PM, <mike at mindmech.com> wrote:
> >
> >
> >
> >
> >
> >
> > I've shared SocialPageRank <
> https://docs.google.com/present/edit?id=0AbXOdbbcuXxAZGNidmNnNjJfMGRkNmg4OGR2&hl=en&invite=CO_7wcEI
> >
> >
> >
> >
> > Message from mike at mindmech.com:
> > Here's the presentation I'll be giving tonight on Random Walks, PageRank,
> and Social Net Analysis.
> >
> >
> > Click to open:
> > SocialPageRank <
> https://docs.google.com/present/edit?id=0AbXOdbbcuXxAZGNidmNnNjJfMGRkNmg4OGR2&hl=en&invite=CO_7wcEI
> >
> >
> >
> >
> >
> > Google Docs makes it easy to create, store and share online documents,
> spreadsheets and presentations.
> >
> > _______________________________________________
> > ml mailing list
> > ml at lists.noisebridge.net
> > https://www.noisebridge.net/mailman/listinfo/ml
> >
> >
> >
> > _______________________________________________
> > ml mailing list
> > ml at lists.noisebridge.net
> > https://www.noisebridge.net/mailman/listinfo/ml
> >
> >
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.noisebridge.net/pipermail/ml/attachments/20101201/3635f4e0/attachment.htm
More information about the ml
mailing list