Machine Learning/weka
Here are the weka commands I used for discretization, obfuscation (to reduce size of files) and classification of the KDD set. Note: I'm running on a latest gen Macbook that I've overclocked with 8GB ram, which was needed (-Xms4096m -Xmx8192m) during processing even for the obfuscated files.
Discretize: need to unset class temporarily in order to treat the class attribute the same as all other attributes; Not all filters support this, and they consequently cause a lot of pain to apply; This is a small detail in weka that makes it much less usable in many cases.
java -Xms4096m -Xmx8192m -cp weka.jar weka.filters.unsupervised.attribute.Discretize -unset-class-temporarily -F -B 10 -i aUnified.csv -o aUnifiedDiscretized.csv (to understand the command line options, you can invoke java -cp weka.jar weka.filters.unsupervised.attribute.Discretize -h )
Obfuscating:
java -Xms4096m -Xmx8192m -cp weka.jar weka.filters.unsupervised.attribute.Obfuscate -i atest.arff -o atestObf.arff
Classification: this is the command I tried for producing predictions but I wasn't able to get the labels for the test data....
java -Xms4096m -Xmx8192m -cp weka.jar weka.classifiers.bayes.NaiveBayesUpdateable -t atrainObf.arff -T atestObf.arff -p last > aOut1.txt
So instead I wrote a small Java class to do this. Using an updateable classifier so it loads the file one line at a time, so it will fit into memory.
log.info("Loading data..."); NaiveBayesUpdateable nb; { ArffLoader loader = new ArffLoader(); loader.setFile(new File("atrain.arff")); Instances structure = loader.getStructure(); structure.setClassIndex(structure.numAttributes() - 1); // train NaiveBayes nb = new NaiveBayesUpdateable(); nb.buildClassifier(structure); Instance current; while ((current = loader.getNextInstance(structure)) != null) { nb.updateClassifier(current); } } log.info("Now classifying..."); { FileWriter fw = new FileWriter("aPredictions.txt", true); ArffLoader loader = new ArffLoader(); loader.setFile(new File("atest.arff")); Instances structure = loader.getStructure(); structure.setClassIndex(structure.numAttributes() - 1); // classify using NaiveBayes Instance current; while ((current = loader.getNextInstance(structure)) != null) { double clsLabel = nb.classifyInstance(current); double[] distribution = nb.distributionForInstance(current); // here I tried to cap the probability predictions at mean +- one standard deviation of the iq; could instead also just predict distribution[1] value double estimate = Math.max(Math.min(distribution[1], 0.92d),0.80d); // iq mean: 0.86, standard dev 6 log.info("ClassLabel: " + clsLabel + ", estimate: " +estimate); fw.write("" + estimate + "\r\n"); } fw.close(); }