[ml] new kaggle competition

Jared Dunne jareddunne at gmail.com
Fri Nov 12 00:14:41 PST 2010


Hi everyone-

I've posted an adjacency list based from the training data:
http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv
First column: outbound vertex
Remaining columns: list of vertices to which it points

I also created a reversed adjacency list (for tracing backwards along the
edges):
http://dl.dropbox.com/u/14895843/social-network-kaggle/reverse_adj_list.out.csv
First column: inbound vertex
Remaining columns: list of vertices which point to it

Jared-

On Thu, Nov 11, 2010 at 11:51 AM, Jared Dunne <jareddunne at gmail.com> wrote:

> There seemed to be a lot of interest in the competition last night.  We
> sorta splintered off into group discussions about the competition, but never
> really reconvened before Erin's talk started.  Maybe we should report back
> over the mailing list on what thoughts everyone had about the competition?
>
> Theo and I discussed two main areas...
>
> Process:
> - We shouldn't have a single approach to solving the problem.  If people
> have ideas they should run with them and report back their success/failure
> to the group.  The collaboration between our diverse
> ideas/approaches/experiences will be our strength in working together.
> - Since this is throw away code for this competition only, we need not get
> hung up on efficiency or elegant implementations.  That said, if we hit a
> point where our code is not able to perform fast enough then we can address
> it at that point, instead of overengineering from the get-go.
> - Theo suggested that we start by using things like python/ruby scripts to
> massage the starting data set into something more useful (with more
> features), then analyse and visualize that using things like R.
> - I'm wondering if people think it's legit to use the mailing list for
> discussion or if we should create a discussion list for the competition to
> prevent from spamming the main list with competition collboration?
> - Also, as we transform the dataset into different views, we are going to
> end up with some large files that we will be passing around to each other.
> Any suggestions on how to best do that? ML git repo?
>
> Strategy (this is since just brainstorming level ideas):
> - The dataset forms a graph of directed edges between vertices.  At the
> core of this problem will performing analysis on that graph.  The first
> intuitive approach we had come to mind was that the shorter the distance
> between two vertices using existing edges, the more likely it would be that
> an edge could/should exist between those vertices.
> - After the talk, Erin, Theo, and I stumbled on the idea that some vertices
> might be uber-followers (meaning more outbound edges than the average
> vertex) and that some vertices might be uber-followees (meaning more inbound
> edges than average).  This reminded me of PageRank for link graphs, so
> perhaps we can draw from techniques in that vein.  The application of this
> in our problem, might be in weighting since people who follow lots of people
> might be more likely to follow someone further out in their "network" where,
> someone who doesn't follow many people might less likely to follow someone
> outside their "network".
> - Since the edges are directional, we know that it's possible for people to
> "follow" someone with out that person "following back".  At first glance it
> might make sense that the reverse edges would be likely in cases like this.
> However consider a "hub" user with lots of followers who doesn't reciprocate
> with edges back to his followers, then the information of who follows him is
> less important in determining who he would follow.  Conversely, for a user
> who commonly reciprocates with followbacks, then the information on who
> follows her might be useful in suggesting who she follow.
>
> Update:
> - Last night I started thinking about this as a graph theory problem and
> started researching techniques.  This section seemed useful for getting
> started:
> http://en.wikipedia.org/wiki/Graph_theory#Graph-theoretic_data_structures
> - The data provided by kaggle is basically a "indicence list".  Theo and I
> discussed converting the provided data in a form that maps outbound vertices
> to their list of inbound/target vertices, which it turns out is called a
> "adjacency list".
> - I wrote some ruby code last night to generate an adjacency list from the
> original training data.  I dumped it to CSV format where the first column in
> a row is the outbound vertex, and all following columns for a given row are
> the list of inbound vertexs pointed to by the oubtbound vertex's edges.  I
> can upload that somewhere once we figure out the best spot to hand off
> things like this...
>
> So what wonderful ideas were happening on the other side of the room prior
> to Erin's talk?
>
> Jared-
>
>
> On Wed, Nov 10, 2010 at 2:32 PM, Joe Hale <joe at jjhale.com> wrote:
>
>> Hey,
>>
>> I'll be going along to Noisebridge at 7.30 and will start having a look at
>> the social network data in the 45 min before Erin's talk.
>>
>> Laters,
>>
>> Joe
>>
>>
>> On 10 November 2010 13:19, Mike Schachter <mike at mindmech.com> wrote:
>>
>>> Awesome everyone! Just so you know, I won't be in tonight or
>>> next week, please keep me informed via email list and wiki about
>>> what's going on if you can,
>>>
>>>   mike
>>>
>>>
>>>
>>> On Wed, Nov 10, 2010 at 11:44 AM, Shahin Saneinejad <ssaneine at gmail.com>wrote:
>>>
>>>> Hey, I'd really like to help but there's no way I can make it to the
>>>> meeting tonight. My schedule's otherwise flexible in case everyone's open to
>>>> meeting at a different time this week for the competition. If not, maybe I
>>>> can catch up via project wiki notes or something.
>>>>
>>>> Shahin
>>>>
>>>>
>>>> On Wed, Nov 10, 2010 at 11:11 AM, mnsqerr <mnsqerr at webmail.co.za>wrote:
>>>>
>>>>> Mike,
>>>>> This sounds really fun.  Lets do it!
>>>>>
>>>>>
>>>>> The link you posted is not working for me, here is a working link:
>>>>> http://kaggle.com/component/taskmaster/?view=competition&task_id=2464
>>>>>
>>>>>
>>>>>
>>>>> -Erin
>>>>>
>>>>> ------------------------------
>>>>> South Africa premier free email service - webmail.co.za<http://www.webmail.co.za/>
>>>>> <http://b.wm.co.za/click.pwm?cid=20039230&loc=N-MT&seq=4cdaee66>
>>>>> _______________________________________________
>>>>> ml mailing list
>>>>> ml at lists.noisebridge.net
>>>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> ml mailing list
>>>> ml at lists.noisebridge.net
>>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>>
>>>>
>>>
>>> _______________________________________________
>>> ml mailing list
>>> ml at lists.noisebridge.net
>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>
>>>
>>
>> _______________________________________________
>> ml mailing list
>> ml at lists.noisebridge.net
>> https://www.noisebridge.net/mailman/listinfo/ml
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.noisebridge.net/pipermail/ml/attachments/20101112/ffd5830e/attachment.htm 


More information about the ml mailing list