[ml] KDD cup submission status
Thomas Lotze
thomas.lotze at gmail.com
Sun Jun 6 17:41:39 PDT 2010
I love open source software!
The final predicted output (using iq and score as predictors, under a Naive
Bayes model) for algebra and bridge (suitable, I believe, for submission) is
available in http://thomaslotze.com/kdd/output.tgz
The streams.tgz and jarfiles.tgz have been updated with streams for bridge
and my newly-compiled "moa_personal.jar" jarfile.
run_moa.sh should have all the steps needed to duplicate this in MOA
yourself (after creating or importing the SQL tables) -- I've also put up
MOA instructions on the wiki at
https://www.noisebridge.net/wiki/Machine_Learning/moa
Summary: since the moa code was available on sourceforge, I was able to
create a new ClassificationPerformanceEvaluator (called
BasicLoggingClassificationPerformanceEvaluator) and re-compile the MOA
distribution into moa_personal.jar. But this allows us to use this
evaluator to print out row number and predicted probability of cfa. The
evaluator is currently pretty hard-coded for the KDD dataset right now, but
I think I can modify it to a more general task/evaluator for use in the
future (and potentially for inclusion back into the MOA trunk). In any
case, it should work for now.
Hooray for open source machine learning!
-Thomas
On Sun, Jun 6, 2010 at 4:42 PM, Andreas von Hessling
<vonhessling at gmail.com>wrote:
> Thomas,
>
> Have you finished joining the chance values into the steps? If so,
> where can I download this joined_tables.sql.gz file?
> (the streams you provide are algebra only -- do you have bridge as
> well?) I would like to concatenate your merged results with the
> number of skills feature I computed; will then provide this dataset.
>
>
> FYI, I'm trying to run of of the incremental classifiers within weka:
> I've started discretizing numeric values for Naive Bayes Updateable
> classifier (
> http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpdateable.html
> ,
> also see http://weka.wikispaces.com/Classifying+large+datasets) using
> something like this: (need a lot of memory!)
>
> java -Xms2048m -Xmx4096m -cp weka.jar
> weka.filters.unsupervised.attribute.Discretize
> -unset-class-temporarily -F -B 10 -i inputfile -o outputfile
>
> Similarly, one can then run the NB algorithm incrementally; Haven't
> done this yet but Thomas, this may be an alternative if MOA doesn't
> work out.
>
> Andy
>
>
>
> On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <thomas.lotze at gmail.com>
> wrote:
> > All,
> >
> > I've been trying to use MOA to generate a classifier...and while I seem
> to
> > be able to do that, I'm having trouble getting it to actually output
> > classifications for new examples, so thought I'd share my current status
> and
> > see if anyone can help.
> >
> > You can download the stream test and train files from
> > http://thomaslotze.com/kdd/streams.tgz
> > You can also download the jarfiles needed for MOA at
> > http://thomaslotze.com/kdd/jarfiles.tgz
> >
> > Unpack these all into the same directory. Then, in that directory, using
> > the following command, you can create a MOA classifier:
> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
> "LearnModel
> > -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa"
> >
> > You can also summarize the test arff file using the following command:
> > java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff
> >
> > But I cannot find a command for MOA which will input the amodel.moa model
> > and generate predicted classes for atest.arff. The closest I've come is
> the
> > following, which runs amodel.moa on the atest.arff, and must be
> predicting
> > classes and comparing, because it declares how many it got correct:
> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
> > "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -c
> -1)"
> >
> > So if anyone can figure it out (I've been using
http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf as a guide
> > certainly use some help with this step.
> >
> > Cheers,
> > Thomas
> >
> > P.S. If you'd like to get the SQL loaded yourself, you can download
> > joined_tables.sql.gz (which was created using get_output.sh). I then
> used
> > run_moa.sh to create the .arff files and try to run MOA.
> >
> > On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling <
> vonhessling at gmail.com>
> > wrote:
> >>
> >> Mike,
> >> We're working on getting the test dataset orthogonalized. Stay tuned.
> >> Andy
> >>
> >>
> >> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <mike at mindmech.com>
> wrote:
> >> > Hey Andy, the input to the classifier I'm trying to produce is
> >> > the orthogonalized dataset - i.e. the list of 1000+ columns where
> >> > each column has the value of the opportunity for that skill. The
> >> > dataset was produced by Erin and is is broken into several parts,
> >> > for the algebra dataset this looks like:
> >> >
> >> > algebra-output_partaa
> >> > algebra-output_partab
> >> > ..
> >> > algebra-output_partah
> >> >
> >> >
> >> > You're going to have to orthogonalize the test datasets, which
> >> > I don't have a copy of. Erin - are you around? Maybe she can help
> >> > you convert the test datasets?
> >> >
> >> > mike
> >> >
> >> >
> >> >
> >> >
> >> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling
> >> > <vonhessling at gmail.com> wrote:
> >> >>
> >> >> Sweet, Mike. Please note that we need the row -> clusterid mapping
> >> >> for both training AND testing sets. Otherwise it will not help the
> ML
> >> >> algorithms.
> >> >> If I understand correctly, your input are the orthogonalized skills.
> >> >> So far, the girls only provided these orthogonalizations for the
> >> >> training files. I'm computing them for the test sets so you can use
> >> >> them. If I don't understand this assumption correctly, please let me
> >> >> know so I can use my CPU's cycles for other tasks.
> >> >>
> >> >> Ideally you can provide these cluster mappings by about Sunday, which
> >> >> is when I want to start running classifiers. I will need some time
> to
> >> >> actually run the ML algorithms.
> >> >>
> >> >> I have now IQ and IQ strength feature values for all datasets and am
> >> >> hoping time permits to compute chance and chance strength values for
> >> >> rows.
> >> >> Computing # of skills required should not be difficult and I will add
> >> >> this feature as well. I plan on sharing my datasets as new versions
> >> >> become available.
> >> >>
> >> >> Andy
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <mike at mindmech.com>
> >> >> wrote:
> >> >> > So it's taking about 9 hours to create a graph from a 4.4GB file,
> I'm
> >> >> > going to work on improving the code to make it a bit faster, and
> also
> >> >> > am investigating a MapReduce solution.
> >> >> >
> >> >> > Basically the clustering process can be broken down into two
> stages:
> >> >> >
> >> >> > 1) Construct the graph, apply the clustering algorithm to break
> graph
> >> >> > into
> >> >> > clusters
> >> >> > 2) Apply the clustered graph to the data again to classify each
> skill
> >> >> > set
> >> >> >
> >> >> > I'll keep working on it and let everyone know how things are going
> >> >> > with
> >> >> > it,
> >> >> > as I mentioned in another email, the source code is in our new
> >> >> > sourceforge
> >> >> > project's git repository.
> >> >> >
> >> >> > mike
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <mike at mindmech.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Sounds like you're making great progress! I'll be working on the
> >> >> >> graph clustering algorithm for the skill set tonight and will keep
> >> >> >> you posted on how things are going.
> >> >> >>
> >> >> >> mike
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling
> >> >> >> <vonhessling at gmail.com> wrote:
> >> >> >>>
> >> >> >>> Doing a few basic tricks, I catapulted the submission into the
> 50th
> >> >> >>> percentile. That is not even running any ML algorithm.
> >> >> >>>
> >> >> >>> I'm planning on running the NaiveBayesUpdateable classifier
> >> >> >>> (http://weka.wikispaces.com/Classifying+large+datasets) over
> >> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the
> command
> >> >> >>> line to evaluate performance. Another attempt would be to load
> all
> >> >> >>> data into memory (<3GB, even for full Bridge Train) and run
> SVMlib
> >> >> >>> over it.
> >> >> >>>
> >> >> >>> If someone wants to try MOA
http://www.cs.waikato.ac.nz/~abifet/MOA/index.html
> this would be
> >> >> >>> helpful also in the long run (at least a tutorial how to set it
> up
> >> >> >>> and
> >> >> >>> run).
> >> >> >>>
> >> >> >>> The reduced datasets plus the IQ values are linked on the wiki:
> >> >> >>> Features
> >> >> >>> are:
> >> >> >>> ...> row INT,
> >> >> >>> ...> studentid VARCHAR(30),
> >> >> >>> ...> problemhierarchy TEXT,
> >> >> >>> ...> problemname TEXT,
> >> >> >>> ...> problemview INT,
> >> >> >>> ...> problemstepname TEXT,
> >> >> >>> ...> cfa INT,
> >> >> >>> ...> iq REAL
> >> >> >>>
> >> >> >>> IQ strength (number of attempts per student) should be available
> >> >> >>> soon.
> >> >> >>> (perhaps add'l features will become available as well)
> >> >> >>>
> >> >> >>> I'm still hoping somebody could cluster Erin's normalized skills
> >> >> >>> data
> >> >> >>> and provide a row -> cluster id mapping for algebra and bridge
> >> >> >>> train
> >> >> >>> and test sets (I don't have the data any more).
> >> >> >>>
> >> >> >>> Andy
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
> >
> >
>
