KDD Competition 2010: Difference between revisions

From Noisebridge
Jump to navigation Jump to search
No edit summary
No edit summary
Line 21: Line 21:
** put together a [[Machine_Learning/kdd_sample | perl script]] which will take random samples from the data, for working on smaller instances
** put together a [[Machine_Learning/kdd_sample | perl script]] which will take random samples from the data, for working on smaller instances
** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data
** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data
* Andy -- will get Weka working on the data and put together a "how to" guide for doing so
* Andy --  
* Erin -- Will put meeting notes of 5/19 on https://www.noisebridge.net/wiki/Machine_Learning; will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets
* Erin -- Will put meeting notes of 5/19 on https://www.noisebridge.net/wiki/Machine_Learning; will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets


* We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.
 


== Notes ==
== Notes ==
* For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.:  zip asdf.zip algebra_2008_2009_submission.txt
* For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.:  zip asdf.zip algebra_2008_2009_submission.txt
 
* We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.


== Ideas ==  
== Ideas ==  

Revision as of 17:51, 22 May 2010

We're interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data! If you're interested, drop us a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.

Resources

TODOs

  • Vikram -- will help setting up Hadoop for the rest of us & create a guide for Mahout setup
  • Thomas -- will get libsvm working on the data and put together a "how to" guide for doing so
    • put together a perl script which will take random samples from the data, for working on smaller instances
    • put together a simple R script for loading the data
  • Andy --
  • Erin -- Will put meeting notes of 5/19 on https://www.noisebridge.net/wiki/Machine_Learning; will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets


Notes

  • For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.: zip asdf.zip algebra_2008_2009_submission.txt
  • We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.

Ideas

  • Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
  • Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM


Who we are

  • Andy; Machine Learning
  • Thomas; Statistics
  • Erin; Maths
  • Vikram; Hadoop

(insert your name/contact info/expertise here)


How to run Weka (quick 'n dirty tutorial)

  • Download and install Weka
  • Get your KDD data
  • preprocess your data: this command takes 1000 lines from the given training data set and converts it into .csv file
  • attention, in the last sed command you need to replace the long whitespace with a tab. In OSX terminal, you do that by pressing CONTROL+V and then tab. (Copying and pasting the command below won't work, since it interprets the whitespace as spaces)
  • head -n 1000 algebra_2006_2007_train.txt | sed -e 's/[",]/ /g' | sed 's/ /,/g' > algebra_2006_2007_train_1kFormatted.csv
  • The following screencast shows you how to do these steps:
  • In Weka's Explorer, remove some unwanted attributes (I leave this up to your judgment), inspect the dataset.
  • Then you can run a ML algorithm over it, e.g. Neural Networks to predict the student performance.
  • Screencast1
  • Screencast2

How to run SVM