[ml] Volunteer needed: computing Chance feature values using EC2

Andreas von Hessling vonhessling at gmail.com
Sat Jun 5 14:39:26 PDT 2010


Thomas,
I've put a link to the dump download on the KDD wiki page -- the
file's md5 hash is 1e42ff64831d60cced16f5330b84f297.  The upload is
currently running, started at 2.36pm, may take half an hour or so.  It
contains sqlite dumps (start with sqlite <dbfilename>, then on the
command line type .read <dumpfilename>, but may/should also load into
other SQL engines. Contains Algebra ("a") and Bridge ("b") train/test
sets.

Please make sure in your output the rows kept in order.

IQ strength values are also already in these files;  they represent
the number of steps a student has attempted.



sqlite> .schema atest1
CREATE TABLE 'atest1' (
row INT,
studentid VARCHAR(30),
problemhierarchy TEXT,
problemname TEXT,
problemview INT,
problemstepname TEXT,
cfa INT,
iq REAL,
iqstrength REAL);


sqlite> .schema achance
CREATE TABLE "achance"(
problemstepname TEXT,
chance REAL,
chancestrength REAL
);


I'll be unavailable till 4pm, then back. Thanks!

Andy



On Sat, Jun 5, 2010 at 2:10 PM, Thomas Lotze <thomas.lotze at gmail.com> wrote:
> Andreas,
>
> With proper indexing, I think we can do this in approximately 508,912 +
> 1,259,273 lookups (rather than *).  Which is to say, I think I can figure
> out how to put this together; do you have SQL dumps available or an SQL
> server I can access?
>
> Cheers,
> Thomas
>
> P.S. This should not preclude anyone else from working on it, *especially*
> if they want to put together a Hadoop/EC2 solution.
>
> On Sat, Jun 5, 2010 at 2:04 PM, Andreas von Hessling <vonhessling at gmail.com>
> wrote:
>>
>> I have a task to hand out.  This would be a great opportunity to apply
>> what they've learned in Vikram's Hadoop/EC2 sessions.   This is
>> time-critical and I don't think I can do it before Sunday/Monday, by
>> which it is needed.
>>
>> Definitions:
>>
>> Chance Feature: the percentage how many unique problemsteps are solved
>> correctly.
>> Chance Strength Feature: the number of times a particular unique
>> problemstepname occurs.
>>
>> Together they are supposed to represent how easy/hard it is to get the
>> step right;
>>
>> Problem:
>> I have computed all values for Chance and Chance Strength for algebra
>> and bridge.  The problem/task is now to assign both value pairs back
>> to each step (row) in our (test & train) datasets. The issue here is
>> the speed at which this happens when I try to use SQL on my machines.
>> Here's the order of magnitude of the data we're dealing with.
>>
>> The number of steps/rows:
>>
>> Algebra:
>> sqlite> select count(*) from atest;
>> 508,912
>> sqlite> select count(*) from atrain;
>> 8,918,054
>>
>> Bridge:
>> sqlite> select count(*) from btest;
>> 756,386
>> sqlite> select count(*) from btrain;
>> 20,012,498
>>
>>
>> The number of chance/strength values:
>>
>> Algebra:
>> sqlite> select count(*) from achance;
>> count(*)
>> 1,259,273
>>
>> Bridge:
>> sqlite> select count(*) from bchance;
>> count(*)
>> 566,965
>>
>>
>> So for the simplest case, putting chance values into algebra test
>> would require up to 508,912 * 1,259,273 lookups.  I've tried splitting
>> the problem into subproblems (smaller tables), but it still takes
>> about 24 hours.  So SQL is not appropriate;
>>
>> It seems that this can be done with EC2 -- this seems like an
>> analogous problem to our wordcount (hello-world) Hadoop example.  I
>> can provide the data via FTP.
>>
>> Example:
>> Input:
>>
>> steps:
>> step1, some,data,blah
>> ...
>> step99, more,data,blubb
>> step99, evenmore,data,blubb
>> ...
>> chance values:
>> step1,0.92, 260
>> step2,0.22, 21
>> ...
>> step99,0.25, 44
>> ...
>>
>> Output:
>> step1, some,data,blah,0.92, 260
>> ...
>> step99, more,data,blubb,0.25, 44
>> step99, evenmore,data,blubb,0.25, 44
>>
>>
>> Who wants to give it a try?
>>
>> Andy
>
>
> _______________________________________________
> ml mailing list
> ml at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/ml
>
>


More information about the ml mailing list