Machine Learning Meetup Notes: 2010-06-30

From Noisebridge
Jump to navigation Jump to search

Mike's bio overview:

amino acids build proteins

20 amino acids

  • protein has an amino acid sequence (three bases make up an amino acid)
  • dna comprised of 4 bases: A, T, C, G
  • rna comprised of 4 bases, A, U, C, G
  • A goes with T
  • C with G

every three bases is a codon, dave wrote a script that will take the codons and map them to their amino acids

  • protease - are a type of proteins that cleave other proteins?
  • reverse transcriptase - takes viral rna and transcribes it into dna
  • sends mrna (bad) into the ribosomes
  • they replicate very fast in your immune cells and thats how they kill them

99 amino acids in protease (297 dna bases)

reverse transcriptase is not predictable - each sequence is a different length

Possible Features:

  • for the OR acids, perhaps create all possible combinations and weight them by 1/(number of combinations), normal rows weight = 1
  • find most probable sequences (T)
  • correlating permutations (T)
  • molecular weight/length (E/Th)
  • acidity/charge
  • edit distance (differences between the sequences), use to cluster (A)
  • list of known resistant mvt sites (M)
  • find out which sites are most variable


for each site, and look at frequency of each amino acid

could put into a tree classifier