KDD Competition 2010

From Noisebridge
(Difference between revisions)
Jump to: navigation, search
(Undo revision 26677 by 91.121.27.33 (Talk))
 
(9 intermediate revisions by 9 users not shown)
Line 20: Line 20:
 
* [[Machine_Learning/SqliteImport | Importing data into Sqlite]] for SQL'ing the data
 
* [[Machine_Learning/SqliteImport | Importing data into Sqlite]] for SQL'ing the data
 
* [[Machine_Learning/OmniscopeVisualization | Visualizing Sqlite data in Omniscope]] for understanding the data
 
* [[Machine_Learning/OmniscopeVisualization | Visualizing Sqlite data in Omniscope]] for understanding the data
 
+
* [http://swarmfinancial.com/ec2mapping.zip Chance mapping dataset for Vikram's EC2 presentation]
 
+
==TODOs==
+
 
+
* Vikram -- will create a guide for Mahout setup
+
* Thomas -- Attempt clustering skills (subskills, traced skills and rules) using Mahout
+
** put together a [[Machine_Learning/kdd_sample | perl script]] which will take random samples from the data, for working on smaller instances
+
** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data
+
* Andy --  define features for sub-problems (student iq, step difficulty); Do remaining feature transforms: Replace step name with unique step name; remove given features; add features: step success chance, student IQ, complexity
+
* Erin --
+
* Paul -- Create overview of the data: histograms, notable features etc. Visualization?
+
  
 
== Notes ==
 
== Notes ==
Line 43: Line 33:
 
== Who we are ==
 
== Who we are ==
 
* Andy; Machine Learning
 
* Andy; Machine Learning
* Paul; Machine Learning
 
 
* Thomas; Statistics
 
* Thomas; Statistics
 
* Erin; Maths
 
* Erin; Maths
Line 61: Line 50:
 
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage1.swf Screencast1]
 
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage1.swf Screencast1]
 
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage2.swf Screencast2]
 
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage2.swf Screencast2]
 +
 +
== A more step-by-step weka example ==
 +
* [[Machine Learning/weka]]
  
 
== How to run libSVM ==
 
== How to run libSVM ==
 
* See the notes at [[Machine Learning/SVM]]
 
* See the notes at [[Machine Learning/SVM]]
 +
 +
== How to run MOA ==
 +
* See the notes at [[Machine Learning/moa]]

Latest revision as of 11:40, 28 July 2012

We're interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data! If you're interested, drop us a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.

Contents

[edit] Resources

[edit] Notes

  • For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.: zip asdf.zip algebra_2008_2009_submission.txt
  • We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.

[edit] Ideas

  • Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
  • Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM


[edit] Who we are

  • Andy; Machine Learning
  • Thomas; Statistics
  • Erin; Maths
  • Vikram; Hadoop

(insert your name/contact info/expertise here)


[edit] How to run Weka (quick 'n very dirty tutorial)

  • Download and install Weka
  • Get your KDD data & preprocess your data:

this command takes 1000 lines from the given training data set and converts it into .csv file attention, in the last sed command you need to replace the long whitespace with a tab. In OSX terminal, you do that by pressing CONTROL+V and then tab. (Copying and pasting the command below won't work, since it interprets the whitespace as spaces)

head -n 1000 algebra_2006_2007_train.txt | sed -e 's/[",]/ /g' | sed 's/       /,/g' > algebra_2006_2007_train_1kFormatted.csv
  • The following screencast shows you how to do these steps:
  • In Weka's Explorer, remove some unwanted attributes (I leave this up to your judgment), inspect the dataset.
  • Then you can run a ML algorithm over it, e.g. Neural Networks to predict the student performance.
  • Screencast1
  • Screencast2

[edit] A more step-by-step weka example

[edit] How to run libSVM

[edit] How to run MOA

Personal tools