Machine Learning Meetup Notes: 2010-07-07

From Noisebridge
Jump to navigation Jump to search
  • Col-1: patient ID
  • Col-2: responder status ("1" for patients who improved and "0" otherwise)
  • Col-3: Protease nucleotide sequence (if available)
  • Col-4: Reverse Transciptase nucleotide sequence (if available)
  • Col-5: viral load at the beginning of therapy (log-10 units)
  • Col-6: CD4 count at the beginning of therapy

Helpful WEKA Videos http://sentimentmining.net/weka/

molecular weight and length of "PR Sequence" and "RT Sequence" from the training data

  1. start weka
  2. open mweight.csv
  3. remove patient
  4. select resp
  5. filter->unsupervised->attribute->numerictonominal
  6. click to change to first only
  7. apply

neural network classify->functions->multilayerperceptron

  1. resp
  2. start
  • 738 correct predictions a=0 no improvement
  • 66 correct predictions b=1 improvement
  • 56 no improvement classified as improvement
  • 140 improvement classified as no improvement

how well did it do? 80.4% accuracy

  • rows tell you what really happenned
  • columns tell you what was predicted

cluster simplekmeans

  1. change num clusters 5
  2. ok->start

scipy cluster.hierarchy main function called linkage ldist takes levenstein distance of each parts of the set result is a matrix distance hierarchical clustering

single linkage clustering: start with n clusters, take the ones that have the shortest distance between them and make that a cluster. then keep going until you have 1 cluster.

  • when you join two points, you always check both of the distances in that cluster against other points, and then take whatever is smaller

complete linkage: you take the largest distance instead

  • there is also one that takes the average