[ml] Hidden Markov Models
Glen Jarvis
glen at glenjarvis.com
Tue Jul 13 14:16:23 PDT 2010
I was invited to this list for questions I have about Hidden Markov Models.
Please forgive the extra chatter.
I work at a Phylogenomics lab and we do HMM stuff all the time -- with
either the SAM software tool (http://compbio.soe.ucsc.edu/sam.html), or the
newer tool HMMR (http://hmmer.janelia.org/). Conceptually I have 'enough'
knowledge -- like a user to a pc has enough to be able to point and click to
read their email.
But, the fundamental "could I write my own toy example" always escapes me.
Here's what I know (from the ground up):
Conditional Probability:
Although I never reviewed this Wiki page, it's probably a fairly good
introduction:
http://en.wikipedia.org/wiki/Conditional_probability
Baye's Theorem:
This wiki page doesn't explain it in a clear way like I would to
someone just starting off. I'd start with a conditional tree (classic text
book examples are - base decision (root of the tree) is "if the owner
purchased a warranty or not (either yes (left branch) or no (right branch).
And, then a second condition -- let's say the radio breaks (another fork of
the left tree), for example. Then, we can say things like "what's the
probability that, given the condition that someone purchased a warranty,
what's the probability that they have a broken radio)... Obviously this
would be much clear with a nice whiteboard and a decision tree. Baye's
theorem has always twisted a mind or two in probability class - but it's
actually not that hard once you get your mind around it (although I need a
refresher).
http://en.wikipedia.org/wiki/Bayes'_theorem
The Simplest Baysean Network
I'm obviously working myself up to the Wiki page for an HMM (
http://en.wikipedia.org/wiki/Hidden_Markov_model). This explains that an HMM
is "A HMM can be considered as the simplest dynamic Bayesian network." [1]).
But, the simples dynamic Bayesian network Wiki page (
http://en.wikipedia.org/wiki/Dynamic_Bayesian_network) used to talk about
grass, water and sprinklers (as an example). Although I didn't get it 100%,
I always felt comfortable there was a real-world example.
With all of that said, I still couldn't "do it." How was this different
than a Baysean decision tree? If I had a simple exercise that I could work
through, with an 'answer to compare to', it would help.
Best of all, I'd like to write a very small toy program. I'm most
comfortable with Protein Multiple Sequence Alignments (MSA), so if I could
make a very basic hmm from a small MSA, it'd really help..
Can anyone explain to me, like they were speaking to a 6-year old, how a
simple baysean network (HMM) can be created. If you want an example, imagine
the following:
* There are 20 letters of an alphabet (all letters except JOBZUX
(interesting way to remember it :))
* An example of a set of sequences that have been aligned follows:
DLITPLHTYMITGN-VCVHVIKKIIELGGDMDMKCV
NLITPLHSYLRRDELISASVLKKVIELGADRNLRCC
HLITPLHSYLRRDESISASVLKKVIELGADRNLRCC
HLITPLHSYLRRDESISASVLKKVIELGADRNLRCC
HLITPLHSYLRRDESISASVLKKVIELGADRNLRCC
DLITPLHTYMITGN-VCVHVIKKIIELGGDMDMKCV
NLITPLHTYMITGN-VCVHVIKKIIELGGDMDMKCI
NLITPLHTYMITGN-VCVDVIKKIIELGGDMDMKCV
DLITPLHTYTITGN-VCAYVIKKIIELGGDMDMKCV
DLITPLHTYMITGN-VCVHVIKKIIELG--------
So, in this example, the first column is more variable than the second or
the fourteenth.
I should be able to score an HMM and see if it matches another sequence
(notice lower case letters mean an HMM insert state):
masltehaivnvrkliystcledfdnristnarinnydpddgycsdgdiysynhtvrykhikvfkkkyyg
idnrqrqqytdsktalidiigsmilmlkadrknkslvdqykkfvkyiikdnksktanhvfdipnngdmdi
lytyfnsprtrcikldlikymvdvgivnlnyvckktgygilhaylgnmnvdidilewlcnngvdvnlqns
......................................................................
...............NLITPLHTYMITGN.VCVDVIKKIIELGGDMDMKCVngmspimtymtnidnvnpe
itnayiesldgdkvknipmilhsyitlarnidisvvysflqpgvklhykdsagrtclhqyilrhnistni
ikllheygndvnepdnigntvlhtylsmlsvvhildpetdndirldviqcllslgaditavnclgytplt
syictaqnymyydiidclisdkvlnmvkhrilqdllirvddtpciihhiiakyniptdlytdeyepydst
dihdvyhcaiierynnavcetsgmtplhvsiishtnanivmdsfvyllsiqaniniptkngvdplmltme
nnmlsghqwylvknildkrpnvdivisfldkcyaagkfpslllseddiikptlrlalmlagldycnkcie
ymerdiaildnshamflafdklvsirdnidkltklhinsrsnisiydilvskcykediithrenhnlvac
chgndplydiinkyitdarsmyyiandisryimdmypvmripvpllfsciigifrltyfkkiiidrhhds
finarltdea
I know we can use a substitution matrix (i.e., blosum62;
http://www.uky.edu/Classes/BIO/520/BIO520WWW/blosum62.htm) to say how likely
it is for one of the letters to change to another letter, but how do I
incorporate that likelihood given that likelihood of the previous letter
(the likelihood that the second column is an L, given the first column is a
D (obviously pretty high in the above example).
Here's a snippet of what was generated for this example:
HMM A C D E F G H
I K L M N P Q R S
T V W Y
m->m m->i m->d i->m i->i d->m d->d
COMPO 2.95757 3.22921 2.84828 2.93689 4.02506 2.84493 3.33255
2.33133 2.65067 2.12155 3.45566 3.32303 3.36052 3.85947 3.13260
2.99602 2.82345 2.43542 5.63061 3.48580
2.68618 4.42225 2.77519 2.73123 3.46354 2.40513 3.72494
3.29354 2.67741 2.69355 4.24690 2.90347 2.73739 3.18146 2.89801
2.37887 2.77519 2.98518 4.58477 3.61503
0.01686 4.48732 5.20967 0.61958 0.77255 0.00000 *
1 3.06925 5.62515 1.55218 2.36308 4.87352 3.48242 2.02657
4.44300 2.84967 3.92317 4.73346 1.82577 4.04654 3.07283 3.40453
2.97597 3.33762 4.00967 6.04161 4.55323 1 x -
2.68618 4.42225 2.77519 2.73123 3.46354 2.40513 3.72494
3.29354 2.67741 2.69355 4.24690 2.90347 2.73739 3.18146 2.89801
2.37887 2.77519 2.98518 4.58477 3.61503
0.01686 4.48732 5.20967 0.61958 0.77255 0.48576 0.95510
Err.. at this point I'm so confused on what the actual product of an HMM
actually is, I forget all I did know...
Has anyone worked with HMMs in a different context? Are there toy examples
we could write? Do you have any input on this?
Cheers,
Glen
[1] http://en.wikipedia.org/wiki/Hidden_Markov_model
--
Whatever you can do or imagine, begin it;
boldness has beauty, magic, and power in it.
-- Goethe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.noisebridge.net/pipermail/ml/attachments/20100713/6c8ae9b4/attachment.htm
More information about the ml
mailing list