[ml] new kaggle competition
jareddunne at gmail.com
Fri Nov 12 00:14:41 PST 2010
I've posted an adjacency list based from the training data:
First column: outbound vertex
Remaining columns: list of vertices to which it points
I also created a reversed adjacency list (for tracing backwards along the
First column: inbound vertex
Remaining columns: list of vertices which point to it
On Thu, Nov 11, 2010 at 11:51 AM, Jared Dunne <jareddunne at gmail.com> wrote:
> There seemed to be a lot of interest in the competition last night. We
> sorta splintered off into group discussions about the competition, but never
> really reconvened before Erin's talk started. Maybe we should report back
> over the mailing list on what thoughts everyone had about the competition?
> Theo and I discussed two main areas...
> - We shouldn't have a single approach to solving the problem. If people
> have ideas they should run with them and report back their success/failure
> to the group. The collaboration between our diverse
> ideas/approaches/experiences will be our strength in working together.
> - Since this is throw away code for this competition only, we need not get
> hung up on efficiency or elegant implementations. That said, if we hit a
> point where our code is not able to perform fast enough then we can address
> it at that point, instead of overengineering from the get-go.
> - Theo suggested that we start by using things like python/ruby scripts to
> massage the starting data set into something more useful (with more
> features), then analyse and visualize that using things like R.
> - I'm wondering if people think it's legit to use the mailing list for
> discussion or if we should create a discussion list for the competition to
> prevent from spamming the main list with competition collboration?
> - Also, as we transform the dataset into different views, we are going to
> end up with some large files that we will be passing around to each other.
> Any suggestions on how to best do that? ML git repo?
> Strategy (this is since just brainstorming level ideas):
> - The dataset forms a graph of directed edges between vertices. At the
> core of this problem will performing analysis on that graph. The first
> intuitive approach we had come to mind was that the shorter the distance
> between two vertices using existing edges, the more likely it would be that
> an edge could/should exist between those vertices.
> - After the talk, Erin, Theo, and I stumbled on the idea that some vertices
> might be uber-followers (meaning more outbound edges than the average
> vertex) and that some vertices might be uber-followees (meaning more inbound
> edges than average). This reminded me of PageRank for link graphs, so
> perhaps we can draw from techniques in that vein. The application of this
> in our problem, might be in weighting since people who follow lots of people
> might be more likely to follow someone further out in their "network" where,
> someone who doesn't follow many people might less likely to follow someone
> outside their "network".
> - Since the edges are directional, we know that it's possible for people to
> "follow" someone with out that person "following back". At first glance it
> might make sense that the reverse edges would be likely in cases like this.
> However consider a "hub" user with lots of followers who doesn't reciprocate
> with edges back to his followers, then the information of who follows him is
> less important in determining who he would follow. Conversely, for a user
> who commonly reciprocates with followbacks, then the information on who
> follows her might be useful in suggesting who she follow.
> - Last night I started thinking about this as a graph theory problem and
> started researching techniques. This section seemed useful for getting
> - The data provided by kaggle is basically a "indicence list". Theo and I
> discussed converting the provided data in a form that maps outbound vertices
> to their list of inbound/target vertices, which it turns out is called a
> "adjacency list".
> - I wrote some ruby code last night to generate an adjacency list from the
> original training data. I dumped it to CSV format where the first column in
> a row is the outbound vertex, and all following columns for a given row are
> the list of inbound vertexs pointed to by the oubtbound vertex's edges. I
> can upload that somewhere once we figure out the best spot to hand off
> things like this...
> So what wonderful ideas were happening on the other side of the room prior
> to Erin's talk?
> On Wed, Nov 10, 2010 at 2:32 PM, Joe Hale <joe at jjhale.com> wrote:
>> I'll be going along to Noisebridge at 7.30 and will start having a look at
>> the social network data in the 45 min before Erin's talk.
>> On 10 November 2010 13:19, Mike Schachter <mike at mindmech.com> wrote:
>>> Awesome everyone! Just so you know, I won't be in tonight or
>>> next week, please keep me informed via email list and wiki about
>>> what's going on if you can,
>>> On Wed, Nov 10, 2010 at 11:44 AM, Shahin Saneinejad <ssaneine at gmail.com>wrote:
>>>> Hey, I'd really like to help but there's no way I can make it to the
>>>> meeting tonight. My schedule's otherwise flexible in case everyone's open to
>>>> meeting at a different time this week for the competition. If not, maybe I
>>>> can catch up via project wiki notes or something.
>>>> On Wed, Nov 10, 2010 at 11:11 AM, mnsqerr <mnsqerr at webmail.co.za>wrote:
>>>>> This sounds really fun. Lets do it!
>>>>> The link you posted is not working for me, here is a working link:
>>>>> South Africa premier free email service - webmail.co.za<http://www.webmail.co.za/>
>>>>> ml mailing list
>>>>> ml at lists.noisebridge.net
>>>> ml mailing list
>>>> ml at lists.noisebridge.net
>>> ml mailing list
>>> ml at lists.noisebridge.net
>> ml mailing list
>> ml at lists.noisebridge.net
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ml