[ml] KDD cup submission status
Andreas von Hessling
vonhessling at gmail.com
Sat Jun 5 11:40:52 PDT 2010
Sweet, Mike. Please note that we need the row -> clusterid mapping
for both training AND testing sets. Otherwise it will not help the ML
If I understand correctly, your input are the orthogonalized skills.
So far, the girls only provided these orthogonalizations for the
training files. I'm computing them for the test sets so you can use
them. If I don't understand this assumption correctly, please let me
know so I can use my CPU's cycles for other tasks.
Ideally you can provide these cluster mappings by about Sunday, which
is when I want to start running classifiers. I will need some time to
actually run the ML algorithms.
I have now IQ and IQ strength feature values for all datasets and am
hoping time permits to compute chance and chance strength values for
Computing # of skills required should not be difficult and I will add
this feature as well. I plan on sharing my datasets as new versions
On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <mike at mindmech.com> wrote:
> So it's taking about 9 hours to create a graph from a 4.4GB file, I'm
> going to work on improving the code to make it a bit faster, and also
> am investigating a MapReduce solution.
> Basically the clustering process can be broken down into two stages:
> 1) Construct the graph, apply the clustering algorithm to break graph into
> 2) Apply the clustered graph to the data again to classify each skill set
> I'll keep working on it and let everyone know how things are going with it,
> as I mentioned in another email, the source code is in our new sourceforge
> project's git repository.
> On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <mike at mindmech.com> wrote:
>> Sounds like you're making great progress! I'll be working on the
>> graph clustering algorithm for the skill set tonight and will keep
>> you posted on how things are going.
>> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling
>> <vonhessling at gmail.com> wrote:
>>> Doing a few basic tricks, I catapulted the submission into the 50th
>>> percentile. That is not even running any ML algorithm.
>>> I'm planning on running the NaiveBayesUpdateable classifier
>>> (http://weka.wikispaces.com/Classifying+large+datasets) over
>>> discretized IQ/IQ strength/Chance/Chance strength from the command
>>> line to evaluate performance. Another attempt would be to load all
>>> data into memory (<3GB, even for full Bridge Train) and run SVMlib
>>> over it.
>>> If someone wants to try MOA
>>> (http://www.cs.waikato.ac.nz/~abifet/MOA/index.html), this would be
>>> helpful also in the long run (at least a tutorial how to set it up and
>>> The reduced datasets plus the IQ values are linked on the wiki: Features
>>> ...> row INT,
>>> ...> studentid VARCHAR(30),
>>> ...> problemhierarchy TEXT,
>>> ...> problemname TEXT,
>>> ...> problemview INT,
>>> ...> problemstepname TEXT,
>>> ...> cfa INT,
>>> ...> iq REAL
>>> IQ strength (number of attempts per student) should be available soon.
>>> (perhaps add'l features will become available as well)
>>> I'm still hoping somebody could cluster Erin's normalized skills data
>>> and provide a row -> cluster id mapping for algebra and bridge train
>>> and test sets (I don't have the data any more).
>>> ml mailing list
>>> ml at lists.noisebridge.net
> ml mailing list
> ml at lists.noisebridge.net
More information about the ml