Editing Machine Learning/SVM (section)

== Converting the Data ==
As with most (if not all) data problems, choosing and formatting the data is the most time-consuming step but also one of the most important.

One approach for reducing the data is to take a subset; you can use Thomas' perl script to take a sample of some number of the training set and test set, by choosing a random subset of the students and only including lines which include them.  You can use the perl script [[Machine_Learning/kdd_sample | sample_training.pl]] to do this, by running:
 perl sample_training.pl -numitems 100 ~/kdd/algebra_2008_2009_train.txt
(assuming your data is located in ~/kdd)

For SVM, ultimately we need to format the data in two files: a training file and a test file.  Each of these will have a numeric class and several numeric predictors.  The general format is as follows:
 &lt;class&gt; 1:&lt;value&gt; 2:&lt;value&gt; 3:&lt;value&gt; ...
with an entry (1:, 2:, 3:,...) for each numeric predictor.  For example,
 0 1:0 2:0 3:0 4:0 5:0 6:1 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0

Thomas created a [[Machine Learning/convert_features.pl | perl script]] to take a training set and convert it (and the corresponding test set) into the correct format by using "correct on first attempt" as the output class and converting student and problem id into a series of binary flag variables (one for each student and problem, indicating whether this class regards this student or this problem).  However, this results in a fairly obscene number of predictor variables, even on a stripped-down dataset.  So there is almost certainly a better way.  But if you don't have one, you can download this script and run
 perl convert_features.pl ~/kdd/algebra_2008_2009_train.txt_sample_100_random_students.csv

Assuming your data files are in ~/kdd, this will generate output files ~/kdd/algebra_2008_2009_train.txt_sample_10_random_students.csv_converted.txt and ~/kdd/algebra_2008_2009_train.txt_sample_10_random_students.csv_converted.t in the appropriate format.