Machine Learning/weka

From Noisebridge
Jump to navigation Jump to search

Here are the weka commands I used for discretization, obfuscation (to reduce size of files) and classification of the KDD set. Note: I'm running on a latest gen Macbook that I've overclocked with 8GB ram, which was needed (-Xms4096m -Xmx8192m) during processing even for the obfuscated files.

Discretize: need to unset class temporarily in order to treat the class attribute the same as all other attributes; Not all filters support this, and they consequently cause a lot of pain to apply; This is a small detail in weka that makes it much less usable in many cases.

java -Xms4096m -Xmx8192m -cp weka.jar
weka.filters.unsupervised.attribute.Discretize
-unset-class-temporarily -F -B 10 -i aUnified.csv -o
aUnifiedDiscretized.csv
 (to understand the command line options, you can invoke java -cp
weka.jar weka.filters.unsupervised.attribute.Discretize -h )

Obfuscating:

java -Xms4096m -Xmx8192m -cp weka.jar
weka.filters.unsupervised.attribute.Obfuscate -i atest.arff -o
atestObf.arff

Classification: this is the command I tried for producing predictions but I wasn't able to get the labels for the test data....

java -Xms4096m -Xmx8192m -cp weka.jar
weka.classifiers.bayes.NaiveBayesUpdateable -t atrainObf.arff -T
atestObf.arff -p last > aOut1.txt

So instead I wrote a small Java class to do this. Using an updateable classifier so it loads the file one line at a time, so it will fit into memory.


       log.info("Loading data...");
       NaiveBayesUpdateable nb;
       {
               ArffLoader loader = new ArffLoader();
               loader.setFile(new File("atrain.arff"));
               Instances structure = loader.getStructure();
               structure.setClassIndex(structure.numAttributes() - 1);

               // train NaiveBayes
               nb = new NaiveBayesUpdateable();
               nb.buildClassifier(structure);
               Instance current;
               while ((current = loader.getNextInstance(structure)) != null) {
                 nb.updateClassifier(current);
               }

       }

       log.info("Now classifying...");
       {
           FileWriter fw = new FileWriter("aPredictions.txt", true);
               ArffLoader loader = new ArffLoader();
               loader.setFile(new File("atest.arff"));
               Instances structure = loader.getStructure();
               structure.setClassIndex(structure.numAttributes() - 1);

               // classify using NaiveBayes
               Instance current;
               while ((current = loader.getNextInstance(structure)) != null) {
                   double clsLabel = nb.classifyInstance(current);
                   double[] distribution = nb.distributionForInstance(current);
// here I tried to cap the probability predictions at mean +- one
standard deviation of the iq; could instead also just predict
distribution[1] value
double estimate = Math.max(Math.min(distribution[1], 0.92d),0.80d); //
iq mean: 0.86, standard dev 6
                   log.info("ClassLabel: " + clsLabel + ", estimate: " +estimate);
                   fw.write("" + estimate + "\r\n");
               }
           fw.close();

       }