No subject


Mon Feb 22 19:56:25 PST 2010


Too big (imposed serious computational challenges, limited the types
of methods that can be applied)
Adequate (the computational load was easy to handle)
Was the time constraint imposed by the challenge a difficulty or did
you feel enough time to understand the data, prepare it, and train
models?

Not enough time
Enough time
It was enough time to do something decent, but there was a lot left to
explore. With more time performance could have been significantly
improved.
How likely are you to keep working on this problem?

It is my main research area.
It was a very interesting problem. I'll keep working on it.
This data is a good fit for the data mining methods I am
using/developing. I will use it in the future for empirical
evaluation.
Maybe I'll try some ideas , but it is not high priority.
Not likely to keep working on it.
Comments on the problem (What aspects of the problem you found most
interesting? Did it inspire you to develop new techniques?)

References

List references below.





On Mon, Jun 7, 2010 at 3:59 PM, Andreas von Hessling
<vonhessling at gmail.com> wrote:
> Oops, the previous dataset I announced was in .csv format and the
> commas messed up the data. =A0I've relinked a new zip file in tab
> separated format from the wiki for download. =A0Uploading now.
> MD5 (4skillsAddedNoDiscretization.zip) =3D dd6da9163dff5a570a80ec9bc8eaae=
dd
>
>
>
> On Mon, Jun 7, 2010 at 1:23 PM, Andreas von Hessling
> <vonhessling at gmail.com> wrote:
>> I've added the latest datasets to the wiki (uploading for about
>> another half an hour). =A0It contains step success chance and # of
>> skills values. =A0Numeric values are not discretized. =A0(I have started
>> discretizing them for the Naive Bayes algorithm though)
>>
>> MD5 (4skillsAddedNoDiscretization.zip) =3D bb70e584f729b0b0c1edba14eff45=
b73
>>
>> If we can do so in time, we will add the clustered skills feature as
>> well, but that's it. =A0Let the algorithms run free!
>>
>> BTW, the evaluation website seems to be slowing down under the
>> increased load just before the deadline. =A0Something to consider.
>>
>> Andy
>>
>>
>>
>> On Sun, Jun 6, 2010 at 5:41 PM, Thomas Lotze <thomas.lotze at gmail.com> wr=
ote:
>>> I love open source software!
>>>
>>> The final predicted output (using iq and score as predictors, under a N=
aive
>>> Bayes model) for algebra and bridge (suitable, I believe, for submissio=
n) is
>>> available in http://thomaslotze.com/kdd/output.tgz
>>>
>>> The streams.tgz and jarfiles.tgz have been updated with streams for bri=
dge
>>> and my newly-compiled "moa_personal.jar" jarfile.
>>>
>>> run_moa.sh should have all the steps needed to duplicate this in MOA
>>> yourself (after creating or importing the SQL tables) -- I've also put =
up
>>> MOA instructions on the wiki at
>>> https://www.noisebridge.net/wiki/Machine_Learning/moa
>>>
>>> Summary: since the moa code was available on sourceforge, I was able to
>>> create a new ClassificationPerformanceEvaluator (called
>>> BasicLoggingClassificationPerformanceEvaluator) and re-compile the MOA
>>> distribution into moa_personal.jar.=A0 But this allows us to use this
>>> evaluator to print out row number and predicted probability of cfa.=A0 =
The
>>> evaluator is currently pretty hard-coded for the KDD dataset right now,=
 but
>>> I think I can modify it to a more general task/evaluator for use in the
>>> future (and potentially for inclusion back into the MOA trunk).=A0 In a=
ny
>>> case, it should work for now.
>>>
>>> Hooray for open source machine learning!
>>>
>>> -Thomas
>>>
>>> On Sun, Jun 6, 2010 at 4:42 PM, Andreas von Hessling <vonhessling at gmail=
.com>
>>> wrote:
>>>>
>>>> Thomas,
>>>>
>>>> Have you finished joining the chance values into the steps? =A0If so,
>>>> where can I download this joined_tables.sql.gz file?
>>>> (the streams you provide are algebra only -- do you have bridge as
>>>> well?) =A0I would like to concatenate your merged results with the
>>>> number of skills feature I computed; will then provide this dataset.
>>>>
>>>>
>>>> FYI, I'm trying to run of of the incremental classifiers within weka:
>>>> I've started discretizing numeric values for Naive Bayes Updateable
>>>> classifier
>>>> (http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayesUpda=
teable.html,
>>>> also see http://weka.wikispaces.com/Classifying+large+datasets) using
>>>> something like this: =A0(need a lot of memory!)
>>>>
>>>> java -Xms2048m -Xmx4096m -cp weka.jar
>>>> weka.filters.unsupervised.attribute.Discretize
>>>> -unset-class-temporarily -F -B 10 -i inputfile -o outputfile
>>>>
>>>> Similarly, one can then run the NB algorithm incrementally; =A0Haven't
>>>> done this yet but Thomas, this may be an alternative if MOA doesn't
>>>> work out.
>>>>
>>>> Andy
>>>>
>>>>
>>>>
>>>> On Sun, Jun 6, 2010 at 1:07 AM, Thomas Lotze <thomas.lotze at gmail.com>
>>>> wrote:
>>>> > All,
>>>> >
>>>> > I've been trying to use MOA to generate a classifier...and while I s=
eem
>>>> > to
>>>> > be able to do that, I'm having trouble getting it to actually output
>>>> > classifications for new examples, so thought I'd share my current st=
atus
>>>> > and
>>>> > see if anyone can help.
>>>> >
>>>> > You can download the stream test and train files from
>>>> > http://thomaslotze.com/kdd/streams.tgz
>>>> > You can also download the jarfiles needed for MOA at
>>>> > http://thomaslotze.com/kdd/jarfiles.tgz
>>>> >
>>>> > Unpack these all into the same directory.=A0 Then, in that directory=
,
>>>> > using
>>>> > the following command, you can create a MOA classifier:
>>>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
>>>> > "LearnModel
>>>> > -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.=
moa"
>>>> >
>>>> > You can also summarize the test arff file using the following comman=
d:
>>>> > java -cp .:moa.jar:weka.jar weka.core.Instances atest.arff
>>>> >
>>>> > But I cannot find a command for MOA which will input the amodel.moa
>>>> > model
>>>> > and generate predicted classes for atest.arff.=A0 The closest I've c=
ome is
>>>> > the
>>>> > following, which runs amodel.moa on the atest.arff, and must be
>>>> > predicting
>>>> > classes and comparing, because it declares how many it got correct:
>>>> > java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask
>>>> > "EvaluateModel -m file:amodel.moa -s (ArffFileStream -f atest.arff -=
c
>>>> > -1)"
>>>> >
>>>> > So if anyone can figure it out (I've been using
>>>> > http://www.cs.waikato.ac.nz/~abifet/MOA/Manual.pdf as a guide), I co=
uld
>>>> > certainly use some help with this step.
>>>> >
>>>> > Cheers,
>>>> > Thomas
>>>> >
>>>> > P.S. If you'd like to get the SQL loaded yourself, you can download
>>>> > joined_tables.sql.gz (which was created using get_output.sh).=A0 I t=
hen
>>>> > used
>>>> > run_moa.sh to create the .arff files and try to run MOA.
>>>> >
>>>> > On Sat, Jun 5, 2010 at 2:05 PM, Andreas von Hessling
>>>> > <vonhessling at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> Mike,
>>>> >> We're working on getting the test dataset orthogonalized. =A0Stay t=
uned.
>>>> >> Andy
>>>> >>
>>>> >>
>>>> >> On Sat, Jun 5, 2010 at 1:55 PM, Mike Schachter <mike at mindmech.com>
>>>> >> wrote:
>>>> >> > Hey Andy, the input to the classifier I'm trying to produce is
>>>> >> > the orthogonalized dataset - i.e. the list of 1000+ columns where
>>>> >> > each column has the value of the opportunity for that skill. The
>>>> >> > dataset was produced by Erin and is is broken into several parts,
>>>> >> > for the algebra dataset this looks like:
>>>> >> >
>>>> >> > algebra-output_partaa
>>>> >> > algebra-output_partab
>>>> >> > ..
>>>> >> > algebra-output_partah
>>>> >> >
>>>> >> >
>>>> >> > You're going to have to orthogonalize the test datasets, which
>>>> >> > I don't have a copy of. Erin - are you around? Maybe she can help
>>>> >> > you convert the test datasets?
>>>> >> >
>>>> >> > =A0 mike
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On Sat, Jun 5, 2010 at 11:40 AM, Andreas von Hessling
>>>> >> > <vonhessling at gmail.com> wrote:
>>>> >> >>
>>>> >> >> Sweet, Mike. =A0Please note that we need the row -> clusterid ma=
pping
>>>> >> >> for both training AND testing sets. =A0Otherwise it will not hel=
p the
>>>> >> >> ML
>>>> >> >> algorithms.
>>>> >> >> If I understand correctly, your input are the orthogonalized ski=
lls.
>>>> >> >> So far, the girls only provided these orthogonalizations for the
>>>> >> >> training files. =A0I'm computing them for the test sets so you c=
an use
>>>> >> >> them. =A0If I don't understand this assumption correctly, please=
 let
>>>> >> >> me
>>>> >> >> know so I can use my CPU's cycles for other tasks.
>>>> >> >>
>>>> >> >> Ideally you can provide these cluster mappings by about Sunday,
>>>> >> >> which
>>>> >> >> is when I want to start running classifiers. =A0I will need some=
 time
>>>> >> >> to
>>>> >> >> actually run the ML algorithms.
>>>> >> >>
>>>> >> >> I have now IQ and IQ strength feature values for all datasets an=
d am
>>>> >> >> hoping time permits to compute chance and chance strength values=
 for
>>>> >> >> rows.
>>>> >> >> Computing # of skills required should not be difficult and I wil=
l
>>>> >> >> add
>>>> >> >> this feature as well. =A0I plan on sharing my datasets as new ve=
rsions
>>>> >> >> become available.
>>>> >> >>
>>>> >> >> Andy
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> On Fri, Jun 4, 2010 at 1:42 PM, Mike Schachter <mike at mindmech.co=
m>
>>>> >> >> wrote:
>>>> >> >> > So it's taking about 9 hours to create a graph from a 4.4GB fi=
le,
>>>> >> >> > I'm
>>>> >> >> > going to work on improving the code to make it a bit faster, a=
nd
>>>> >> >> > also
>>>> >> >> > am investigating a MapReduce solution.
>>>> >> >> >
>>>> >> >> > Basically the clustering process can be broken down into two
>>>> >> >> > stages:
>>>> >> >> >
>>>> >> >> > 1) Construct the graph, apply the clustering algorithm to brea=
k
>>>> >> >> > graph
>>>> >> >> > into
>>>> >> >> > clusters
>>>> >> >> > 2) Apply the clustered graph to the data again to classify eac=
h
>>>> >> >> > skill
>>>> >> >> > set
>>>> >> >> >
>>>> >> >> > I'll keep working on it and let everyone know how things are g=
oing
>>>> >> >> > with
>>>> >> >> > it,
>>>> >> >> > as I mentioned in another email, the source code is in our new
>>>> >> >> > sourceforge
>>>> >> >> > project's git repository.
>>>> >> >> >
>>>> >> >> > =A0mike
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > On Thu, Jun 3, 2010 at 7:48 PM, Mike Schachter <mike at mindmech.=
com>
>>>> >> >> > wrote:
>>>> >> >> >>
>>>> >> >> >> Sounds like you're making great progress! I'll be working on =
the
>>>> >> >> >> graph clustering algorithm for the skill set tonight and will
>>>> >> >> >> keep
>>>> >> >> >> you posted on how things are going.
>>>> >> >> >>
>>>> >> >> >> =A0 mike
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >> On Thu, Jun 3, 2010 at 6:17 PM, Andreas von Hessling
>>>> >> >> >> <vonhessling at gmail.com> wrote:
>>>> >> >> >>>
>>>> >> >> >>> Doing a few basic tricks, I catapulted the submission into t=
he
>>>> >> >> >>> 50th
>>>> >> >> >>> percentile. =A0That is not even running any ML algorithm.
>>>> >> >> >>>
>>>> >> >> >>> I'm planning on running the NaiveBayesUpdateable classifier
>>>> >> >> >>> (http://weka.wikispaces.com/Classifying+large+datasets) over
>>>> >> >> >>> discretized IQ/IQ strength/Chance/Chance strength from the
>>>> >> >> >>> command
>>>> >> >> >>> line to evaluate performance. =A0Another attempt would be to=
 load
>>>> >> >> >>> all
>>>> >> >> >>> data into memory (<3GB, even for full Bridge Train) and run
>>>> >> >> >>> SVMlib
>>>> >> >> >>> over it.
>>>> >> >> >>>
>>>> >> >> >>> If someone wants to try MOA
>>>> >> >> >>> (http://www.cs.waikato.ac.nz/~abifet/MOA/index.html), this w=
ould
>>>> >> >> >>> be
>>>> >> >> >>> helpful also in the long run (at least a tutorial how to set=
 it
>>>> >> >> >>> up
>>>> >> >> >>> and
>>>> >> >> >>> run).
>>>> >> >> >>>
>>>> >> >> >>> The reduced datasets plus the IQ values are linked on the wi=
ki:
>>>> >> >> >>> Features
>>>> >> >> >>> are:
>>>> >> >> >>> =A0 ...> row INT,
>>>> >> >> >>> =A0 ...> studentid VARCHAR(30),
>>>> >> >> >>> =A0 ...> problemhierarchy TEXT,
>>>> >> >> >>> =A0 ...> problemname TEXT,
>>>> >> >> >>> =A0 ...> problemview INT,
>>>> >> >> >>> =A0 ...> problemstepname TEXT,
>>>> >> >> >>> =A0 ...> cfa INT,
>>>> >> >> >>> =A0 ...> iq REAL
>>>> >> >> >>>
>>>> >> >> >>> IQ strength (number of attempts per student) should be avail=
able
>>>> >> >> >>> soon.
>>>> >> >> >>> =A0(perhaps add'l features will become available as well)
>>>> >> >> >>>
>>>> >> >> >>> I'm still hoping somebody could cluster Erin's normalized sk=
ills
>>>> >> >> >>> data
>>>> >> >> >>> and provide a row -> cluster id mapping for algebra and brid=
ge
>>>> >> >> >>> train
>>>> >> >> >>> and test sets (I don't have the data any more).
>>>> >> >> >>>
>>>> >> >> >>> Andy
>>>> >> >> >>> _______________________________________________
>>>> >> >> >>> ml mailing list
>>>> >> >> >>> ml at lists.noisebridge.net
>>>> >> >> >>> https://www.noisebridge.net/mailman/listinfo/ml
>>>> >> >> >>
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > _______________________________________________
>>>> >> >> > ml mailing list
>>>> >> >> > ml at lists.noisebridge.net
>>>> >> >> > https://www.noisebridge.net/mailman/listinfo/ml
>>>> >> >> >
>>>> >> >> >
>>>> >> >
>>>> >> >
>>>> >> _______________________________________________
>>>> >> ml mailing list
>>>> >> ml at lists.noisebridge.net
>>>> >> https://www.noisebridge.net/mailman/listinfo/ml
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > ml mailing list
>>>> > ml at lists.noisebridge.net
>>>> > https://www.noisebridge.net/mailman/listinfo/ml
>>>> >
>>>> >
>>>
>>>
>>> _______________________________________________
>>> ml mailing list
>>> ml at lists.noisebridge.net
>>> https://www.noisebridge.net/mailman/listinfo/ml
>>>
>>>
>>
>


More information about the ml mailing list