Editing KDD Competition 2010

Jump to navigation Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then publish the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
We're interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data!  If you're interested, drop us a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.
I'm interested in working on the KDD Competition, as a way to focus our machine learning exploration -- and maybe even finding some interesting aspects to the data!  If you're interested, drop me a note, show up at a weekly Machine Learning meeting, and we'll use this space to keep track of our ideas.


==Resources==
==Resources==
* [[Machine Learning]]
* [https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp KDD Rules and Data Format]
* [https://pslcdatashop.web.cmu.edu/KDDCup/rules_data_format.jsp KDD Rules and Data Format]
* [http://cran.r-project.org/ R language]
* [http://cran.r-project.org/ R]
* [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm]
* [http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm]
* [http://www.cs.waikato.ac.nz/ml/weka/ Weka]
* [http://www.cs.waikato.ac.nz/ml/weka/ Weka]
* [http://www.kdnuggets.com/datasets/competitions.html List of other competitions in which we could engage]
* [[Machine Learning/Hadoop | Hadoop]]
* [http://lucene.apache.org/mahout/ Mahout -- machine learning libraries for Hadoop]
* [http://www.cloudera.com/videos/introduction_to_pig So-so intro to Pig Video]
* [http://s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/ElasticMapReduce-PigTutorial.html An AWESOME intro to Pig on Elastic Map Reduce!]
* [http://hadoop.apache.org/pig/ Pig language]
* [http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html Pig Latin Manual]
* [http://www.cloudera.com/ Cloudera -- see videos for Hadoop intro]
* [http://github.com/voberoi/hadoop-mrutils Vikram's awesome Hadoop/EC2 scripts]
* [https://www.noisebridge.net/mailman/listinfo/ml Our mailing list]
* [http://www.s3fox.net/ S3Fox]
* [[Machine_Learning/SqliteImport | Importing data into Sqlite]] for SQL'ing the data
* [[Machine_Learning/OmniscopeVisualization | Visualizing Sqlite data in Omniscope]] for understanding the data
* [http://swarmfinancial.com/ec2mapping.zip Chance mapping dataset for Vikram's EC2 presentation]


== Notes ==
==Plan for Next Week==
* For KDD submission: to zip the submission file on OSX: use command line, otherwise will complain about __MACOSX file: e.g.:  zip asdf.zip algebra_2008_2009_submission.txt
Next week, after the Hadoop presentation, we'll show each other how to get the tools working on the data (what one needs to download, any data transformations needed, how to produce submission output) and share any insights on the data gleaned so far
* We will need to make sure we don't get disqualified for people belonging to multiple teams! Do not sign up anybody else for the competition without asking first.
* Vikarem -- will present on [[Machine Learning/Hadoop | Hadoop]] next week!
 
* Thomas -- will get libsvm working on the data and put together a "how to" guide for doing so
== Ideas ==
** put together a [[Machine_Learning/kdd_sample | perl script]] which will take random samples from the data, for working on smaller instances
* Add new features by computing their values from existing columns -- e.g. correlation between skills based on their co-occurence within problems. Could use Decision tree to define boundaries between e.g. new "good student, medium student, bad student" feature
** put together a [[Machine_Learning/kdd_r | simple R script]] for loading the data
* Dimensionality reduction -- transform into numerical values appropriate for consumption by SVM
* Andy -- will get Weka working on the data and put together a "how to" guide for doing so
 
* Erin -- will work on data transformations and ways to create better representations of the data
 
== Who we are ==
* Andy; Machine Learning
* Thomas; Statistics
* Erin; Maths
* Vikram; Hadoop
(insert your name/contact info/expertise here)
 
 
== How to run Weka (quick 'n very dirty tutorial) ==
* Download and install Weka
* Get your KDD data & preprocess your data:
this command takes 1000 lines from the given training data set and converts it into .csv file
attention, in the last sed command you need to replace the long whitespace with a tab.  In OSX terminal, you do that by pressing CONTROL+V and then tab. (Copying and pasting the command below won't work, since it interprets the whitespace as spaces)
head -n 1000 algebra_2006_2007_train.txt | sed -e 's/[",]/ /g' | sed 's/      /,/g' > algebra_2006_2007_train_1kFormatted.csv
* The following screencast shows you how to do these steps:
* In Weka's Explorer, remove some unwanted attributes (I leave this up to your judgment), inspect the dataset.
* Then you can run a ML algorithm over it, e.g. Neural Networks to predict the student performance.
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage1.swf Screencast1]
* [http://swarmfinancial.com/screencasts/nb/kddWekaUsage2.swf Screencast2]
 
== A more step-by-step weka example ==
* [[Machine Learning/weka]]
 
== How to run libSVM ==
* See the notes at [[Machine Learning/SVM]]
 
== How to run MOA ==
* See the notes at [[Machine Learning/moa]]
Please note that all contributions to Noisebridge are considered to be released under the Creative Commons Attribution-NonCommercial-ShareAlike (see Noisebridge:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To protect the wiki against automated edit spam, we kindly ask you to solve the following CAPTCHA:

Cancel Editing help (opens in new window)