[ml] Volunteer needed: computing Chance feature values using EC2
thomas.lotze at gmail.com
Sat Jun 5 14:10:56 PDT 2010
With proper indexing, I think we can do this in approximately 508,912 +
1,259,273 lookups (rather than *). Which is to say, I think I can figure
out how to put this together; do you have SQL dumps available or an SQL
server I can access?
P.S. This should not preclude anyone else from working on it, *especially*
if they want to put together a Hadoop/EC2 solution.
On Sat, Jun 5, 2010 at 2:04 PM, Andreas von Hessling
<vonhessling at gmail.com>wrote:
> I have a task to hand out. This would be a great opportunity to apply
> what they've learned in Vikram's Hadoop/EC2 sessions. This is
> time-critical and I don't think I can do it before Sunday/Monday, by
> which it is needed.
> Chance Feature: the percentage how many unique problemsteps are solved
> Chance Strength Feature: the number of times a particular unique
> problemstepname occurs.
> Together they are supposed to represent how easy/hard it is to get the
> step right;
> I have computed all values for Chance and Chance Strength for algebra
> and bridge. The problem/task is now to assign both value pairs back
> to each step (row) in our (test & train) datasets. The issue here is
> the speed at which this happens when I try to use SQL on my machines.
> Here's the order of magnitude of the data we're dealing with.
> The number of steps/rows:
> sqlite> select count(*) from atest;
> sqlite> select count(*) from atrain;
> sqlite> select count(*) from btest;
> sqlite> select count(*) from btrain;
> The number of chance/strength values:
> sqlite> select count(*) from achance;
> sqlite> select count(*) from bchance;
> So for the simplest case, putting chance values into algebra test
> would require up to 508,912 * 1,259,273 lookups. I've tried splitting
> the problem into subproblems (smaller tables), but it still takes
> about 24 hours. So SQL is not appropriate;
> It seems that this can be done with EC2 -- this seems like an
> analogous problem to our wordcount (hello-world) Hadoop example. I
> can provide the data via FTP.
> step1, some,data,blah
> step99, more,data,blubb
> step99, evenmore,data,blubb
> chance values:
> step1,0.92, 260
> step2,0.22, 21
> step99,0.25, 44
> step1, some,data,blah,0.92, 260
> step99, more,data,blubb,0.25, 44
> step99, evenmore,data,blubb,0.25, 44
> Who wants to give it a try?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ml