KDD Competition 2010

From Noisebridge
(Difference between revisions)
Jump to: navigation, search
Line 3: Line 3:
 
==Resources==
 
==Resources==
 
* [https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp KDD Rules and Data Format]
 
* [https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp KDD Rules and Data Format]
* [http://cran.r-project.org/ R]
+
* [http://cran.r-project.org/ R language]
 
* [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm]
 
* [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm]
 
* [http://www.cs.waikato.ac.nz/ml/weka/ Weka]
 
* [http://www.cs.waikato.ac.nz/ml/weka/ Weka]
Line 9: Line 9:
 
* [[Machine Learning/Hadoop | Hadoop]]
 
* [[Machine Learning/Hadoop | Hadoop]]
 
* [http://lucene.apache.org/mahout/ Mahout -- machine learning libraries for Hadoop]
 
* [http://lucene.apache.org/mahout/ Mahout -- machine learning libraries for Hadoop]
 +
* [http://hadoop.apache.org/pig/ Pig language]
 +
* [http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html Pig Latin Manual]
 +
* [http://www.cloudera.com/ Cloudera -- see videos for Hadoop intro]
 +
* [http://github.com/voberoi/hadoop-mrutils Vikram's awesome Hadoop/EC2 scripts]
 +
* [https://www.noisebridge.net/mailman/listinfo/ml Our mailing list]
  
 
==TODOs==
 
==TODOs==
Line 17: Line 22:
 
** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data
 
** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data
 
* Andy -- will get Weka working on the data and put together a "how to" guide for doing so
 
* Andy -- will get Weka working on the data and put together a "how to" guide for doing so
* Erin -- will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets
+
* Erin -- Will put meeting notes of 5/19 on https://www.noisebridge.net/wiki/Machine_Learning; will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets
  
 
* We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.
 
* We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.
  
 
== Notes ==
 
== Notes ==
* to zip the file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.:  zip asdf.zip algebra_2008_2009_submission.txt
+
* For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.:  zip asdf.zip algebra_2008_2009_submission.txt
  
  
Line 28: Line 33:
 
* Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
 
* Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
 
* Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM
 
* Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM
 +
 +
 +
== Who we are ==
 +
* Andy
 +
* Thomas
 +
* Erin
 +
* Vikram
 +
(insert your name/contact info here)

Revision as of 23:24, 19 May 2010

We're interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data! If you're interested, drop us a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.

Contents

Resources

TODOs

  • Vikram -- will help setting up Hadoop for the rest of us & create a guide for Mahout setup
  • Thomas -- will get libsvm working on the data and put together a "how to" guide for doing so
    • put together a perl script which will take random samples from the data, for working on smaller instances
    • put together a simple R script for loading the data
  • Andy -- will get Weka working on the data and put together a "how to" guide for doing so
  • Erin -- Will put meeting notes of 5/19 on https://www.noisebridge.net/wiki/Machine_Learning; will work on data transformations and ways to create better representations of the data; will provide the orthogonalized data sets
  • We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.

Notes

  • For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.: zip asdf.zip algebra_2008_2009_submission.txt


Ideas

  • Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
  • Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM


Who we are

  • Andy
  • Thomas
  • Erin
  • Vikram

(insert your name/contact info here)

Personal tools