Machine Learning Meetup Notes: 2010-07-07

From Noisebridge
Jump to: navigation, search
  • Col-1: patient ID
  • Col-2: responder status ("1" for patients who improved and "0" otherwise)
  • Col-3: Protease nucleotide sequence (if available)
  • Col-4: Reverse Transciptase nucleotide sequence (if available)
  • Col-5: viral load at the beginning of therapy (log-10 units)
  • Col-6: CD4 count at the beginning of therapy

Helpful WEKA Videos

molecular weight and length of "PR Sequence" and "RT Sequence" from the training data

  1. start weka
  2. open mweight.csv
  3. remove patient
  4. select resp
  5. filter->unsupervised->attribute->numerictonominal
  6. click to change to first only
  7. apply

neural network classify->functions->multilayerperceptron

  1. resp
  2. start
  • 738 correct predictions a=0 no improvement
  • 66 correct predictions b=1 improvement
  • 56 no improvement classified as improvement
  • 140 improvement classified as no improvement

how well did it do? 80.4% accuracy

  • rows tell you what really happenned
  • columns tell you what was predicted

cluster simplekmeans

  1. change num clusters 5
  2. ok->start

scipy cluster.hierarchy main function called linkage ldist takes levenstein distance of each parts of the set result is a matrix distance hierarchical clustering

single linkage clustering: start with n clusters, take the ones that have the shortest distance between them and make that a cluster. then keep going until you have 1 cluster.

  • when you join two points, you always check both of the distances in that cluster against other points, and then take whatever is smaller

complete linkage: you take the largest distance instead

  • there is also one that takes the average