[ml] Volunteer needed: computing Chance feature values using EC2
Andreas von Hessling
vonhessling at gmail.com
Sat Jun 5 14:04:06 PDT 2010
I have a task to hand out. This would be a great opportunity to apply
what they've learned in Vikram's Hadoop/EC2 sessions. This is
time-critical and I don't think I can do it before Sunday/Monday, by
which it is needed.
Definitions:
Chance Feature: the percentage how many unique problemsteps are solved
correctly.
Chance Strength Feature: the number of times a particular unique
problemstepname occurs.
Together they are supposed to represent how easy/hard it is to get the
step right;
Problem:
I have computed all values for Chance and Chance Strength for algebra
and bridge. The problem/task is now to assign both value pairs back
to each step (row) in our (test & train) datasets. The issue here is
the speed at which this happens when I try to use SQL on my machines.
Here's the order of magnitude of the data we're dealing with.
The number of steps/rows:
Algebra:
sqlite> select count(*) from atest;
508,912
sqlite> select count(*) from atrain;
8,918,054
Bridge:
sqlite> select count(*) from btest;
756,386
sqlite> select count(*) from btrain;
20,012,498
The number of chance/strength values:
Algebra:
sqlite> select count(*) from achance;
count(*)
1,259,273
Bridge:
sqlite> select count(*) from bchance;
count(*)
566,965
So for the simplest case, putting chance values into algebra test
would require up to 508,912 * 1,259,273 lookups. I've tried splitting
the problem into subproblems (smaller tables), but it still takes
about 24 hours. So SQL is not appropriate;
It seems that this can be done with EC2 -- this seems like an
analogous problem to our wordcount (hello-world) Hadoop example. I
can provide the data via FTP.
Example:
Input:
steps:
step1, some,data,blah
...
step99, more,data,blubb
step99, evenmore,data,blubb
...
chance values:
step1,0.92, 260
step2,0.22, 21
...
step99,0.25, 44
...
Output:
step1, some,data,blah,0.92, 260
...
step99, more,data,blubb,0.25, 44
step99, evenmore,data,blubb,0.25, 44
Who wants to give it a try?
Andy
More information about the ml
mailing list