Editing Hands-on Machine Learning

= Hands-on Machine Learning =

The goal of this presentation is to give a very applied, hands-on introduction to a range of Machine Learning techniques.  There will be a quick discussion of what kinds of data you can have, what kinds of tasks machine learning techniques allow you to, and, finally, a survey of common techniques.  At the end, if you know what kinds of data you have, and what your goal is, this will let you get down to just a list of techniques that could be appropriate for your task.

Ideally, each technique will include a hands-on section for just that technique.  This should cover any tools that implement the algorithm, how to get your data into those tools, and how to extract the model from those tools and incorporate it into your own code.


== The General Machine Learning Process ==

In general, applying machine learning techniques goes something like this:

* Collect your data
* Import that data into some training tool
* Train a "model" on that data
* Tweak your data, tweak model parameters, etc, and repeat training
* Eventually you get an output model
* Take that model and integrate it with your codebase



== Input Data ==

There are basically just two sorts of input data: nominal and numeric.

Nominal values are things like "Red", "Orange", "Poltergeist", etc.  They're a closed set of discrete values, typically strings instead of numbers.  However, you can use numbers as a nominal value, so long as they're from a closed set.  These are usually from human judgment: there's not necessarily a good line between red and orange, but a human makes the call and records the value.  Alternately, a human decides where the arbitrary line is, and writes a bit of code that makes the decision based on that line.

Numeric values are exactly what they sound like: numbers!  They can take many, many values, and are generally based on direct measurements.  For example, the weight of a fruit a robot is holding, or the number of occurances of a given word in a document.


== Machine Learning Tasks ==

Generally speaking, you can ask a Machine Learning algorithm to do one of three things:

=== Numeric prediction ===

Description: "Given the input you've seen in the past, and this set of current values, what values should I expect given this input?"

A trivial example: if you have a dataset that's pairs of (Yesterday's high temperature, today's high temperature), you could train a numeric predictor that would give you an estimate of today's high temperature given yesterday's.  Or, given the highs for the last week, it could predict the highs for the next 3 days.

=== Labelling/Classification ===

Description: "Given the input you've seen in the past, and this set of current values, what would you label this?"

The best-known example of this is the spam filter.  After labelling a bunch of email as spam or not-spam, you train a classifier.  Then, you can use that classifier on new email to decide whether the machine believes it to be spam or not.  Of course, this generalizes, and you can use exactly the same technique to separate personal, work, and hobby email.

=== Clustering ===

Description: "Given the input you've seen in the past, and this set of current values, what previous inputs is it most like?"

This task is very similar to classification, with one big difference: you don't have labels.  For example, if you have a bunch of measurements of flowers, you can use clustering to discover if there are underlying patterns you've missed out on, perhaps representing growing conditions or a difference in (sub)species.


== Specific Techniques ==

=== Decision Trees ===

Task: Labelling
Input data types: nominal (or numeric, with conditionals)

Description: A decision tree is something like a flow chart.  It's a tree of decision boxes; you start at the root and, based on your data, follow decisions down to leaf nodes.  At the leaf nodes, you'll typically have a label.

Training: Get your data into Weka Explorer by hook or crook, then choose Classifier -> Trees -> J48.  Select the nominal value you want to use as your label in the dropdown.  Make sure you've got cross-validation selected, ideally with 10-fold or so. 

Hit "Run" and stand back.  You'll get output like:

<pre>
(using the iris.arff sample data)

=== Run information ===

Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:     iris
Instances:    150
Attributes:   5
              sepallength
              sepalwidth
              petallength
              petalwidth
              class
Test mode:    10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
------------------

petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
|   petalwidth <= 1.7
|   |   petallength <= 4.9: Iris-versicolor (48.0/1.0)
|   |   petallength > 4.9
|   |   |   petalwidth <= 1.5: Iris-virginica (3.0)
|   |   |   petalwidth > 1.5: Iris-versicolor (3.0/1.0)
|   petalwidth > 1.7: Iris-virginica (46.0/1.0)

Number of Leaves  : 	5

Size of the tree : 	9


Time taken to build model: 0.03 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances         144               96      %
Incorrectly Classified Instances         6                4      %
Kappa statistic                          0.94  
Mean absolute error                      0.035 
Root mean squared error                  0.1586
Relative absolute error                  7.8705 %
Root relative squared error             33.6353 %
Total Number of Instances              150     

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   Class
  0.98      0          1         0.98      0.99     Iris-setosa
  0.94      0.03       0.94      0.94      0.94     Iris-versicolor
  0.96      0.03       0.941     0.96      0.95     Iris-virginica

=== Confusion Matrix ===

  a  b  c   <-- classified as
 49  1  0 |  a = Iris-setosa
  0 47  3 |  b = Iris-versicolor
  0  2 48 |  c = Iris-virginica
</pre>


Evaluation: If you train as above, you're using 10-fold cross-validation, which is a reasonably good evaluation of your training set.  Otherwise, the normal evaluation of labelling algorithms can be used.

Application: To apply the values you've got out of the above, you want to turn this section
<pre>

petalwidth <= 0.6: Iris-setosa (50.0)
petalwidth > 0.6
|   petalwidth <= 1.7
|   |   petallength <= 4.9: Iris-versicolor (48.0/1.0)
|   |   petallength > 4.9
|   |   |   petalwidth <= 1.5: Iris-virginica (3.0)
|   |   |   petalwidth > 1.5: Iris-versicolor (3.0/1.0)
|   petalwidth > 1.7: Iris-virginica (46.0/1.0)
</pre>

Into code in whatever language you use.  This is, unfortunately, a manual process.  However, it's very concise, and generally compact in terms of code.  You can therefore use these decision trees on any computer, no matter how big or small.

=== Naive Bayes Classifier ===

Task: Labelling

Input data types: nominal

Description: Naive Bayes is a statistical technique for predicting the probability of all labels given a set of inputs.  For instance, let's assume we've trained a naive Bayes system on (color, kind of fruit) pairs.  Then, we can ask it for the probability distribution of "kind of fruit" given the color "yellow."  This will tell us that it's almost certainly a banana or lemon, but it could be an apple, and might occasionally be an orange, etc.  That is, it returns a list of labels with an associated probability.

Training:

Evaluation:

Application:
 


=== Support Vector Machines ===

Task: Labelling

Input data types: numeric

Description: Support Vector Machines work by finding lines that separate data points.  Its input values are labelled points in a high-dimensional space.

Training: libsvm and svmlight.

Evaluation:

Application:
 


=== Polynomial Regression ===

Task: Numeric Prediction

Input data types: numeric

Description: This isn't technically machine learning.  It's actually just an inference technique, but it's often a good technique to try as a baseline.

Training:

Evaluation:

Application:
 


=== Neural Networks ===

Task: Numeric Prediction

Input data types: numeric

Description: A neural network allows you predict a number of continuous numeric values based on other continuous values.  "A Neural Network is the second best way to solve any problem."

Training:

Evaluation:

Application:

=== k-Means Clustering ===

Task: Clustering

Input data types: numeric or nominal

Description: k-Means clustering allows you to take a set of feature vectors and decide which group of feature vectors to associate it with.  In a fruit-market universe, this will cluster all the "round, red, dense" things together, separate from the "orange, round, dense" things.

Training: 

Evaluation:

Application:

=== Technique ===

Task:

Input data types:

Description: 

Training:

Evaluation:

Application: