Hidden Markov Models BIOL 7711 Computational Bioscience
description
Transcript of Hidden Markov Models BIOL 7711 Computational Bioscience
![Page 1: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/1.jpg)
Biochemistry and Molecular GeneticsComputational Bioscience Program
Consortium for Comparative GenomicsUniversity of Colorado School of Medicine
Hidden Markov ModelsBIOL 7711
Computational Bioscience
University of Colorado School of Medicine
Consortium for Comparative Genomics
![Page 2: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/2.jpg)
Why a Hidden Markov Model?
Data elements are often linked by a string of connectivity, a linear sequence
Secondary structure prediction (Goldman, Thorne, Jones)CpG islands
Models of exons, introns, regulatory regions, genesMutation rates along genome
![Page 3: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/3.jpg)
Occasionally Dishonest Casino
1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6
1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2
eFair eLoadedaFair=>Loaded
aLoaded=>Fair
![Page 4: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/4.jpg)
Posterior Probability of Dice
![Page 5: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/5.jpg)
Sequence Alignment ProfilesMouse TCR Va
![Page 6: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/6.jpg)
Hidden Markov Models: Bugs and Features
MemorylessSum of states is conserved
(rowsums =1)Complications?
Insertion and deletion of states (indels)Long-distance interactions
BenefitsFlexible probabilistic framework
E.g., compared to regular expressions
![Page 7: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/7.jpg)
Profiles: an Example
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
insert
delete
continue continue
insert
insert
![Page 8: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/8.jpg)
Profiles, an Example: States
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
insert
delete
continue continue
insert
insert
State #1 State #2 State #3
![Page 9: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/9.jpg)
Profiles, an Example: Emission
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
insert
delete
continue continue
insert
insert
State #1 State #2 State #3
Sequence Elements
(possibly emitted by a state)
![Page 10: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/10.jpg)
Profiles, an Example: Emission
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
insert
delete
continue continue
insert
insert
State #1 State #2 State #3
Sequence Elements
(possibly emitted by a state)
Emission Probabilities
![Page 11: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/11.jpg)
Profiles, an Example: Arcs
A .1C .05D .2E .08F .01
Gap A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
delete
continue continue
insert
insert
State #1 State #2 State #3
transition
insert
![Page 12: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/12.jpg)
Profiles, an Example: Special States
A .1C .05D .2E .08F .01
Gap
A .04C .1D .01E .2F .02
Gap A .2C .01D .05E .1F .06
delete
continue continue
insert
insert
State #1 State #2 State #3
transition
insert
Self => SelfLoop
No Delete “State”
![Page 13: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/13.jpg)
![Page 14: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/14.jpg)
A Simpler not very Hidden MM
Nucleotides, no Indels, Unambiguous Path
G .1C .3A .2T .4
G .1C .1A .7T .1
G .3C .3A .1T .3
A0.7
T0.4
T0.3
1.0 1.0 1.0
€
P(D | M) = 0.7∗1.0∗0.4∗1.0∗0.3∗1.0
![Page 15: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/15.jpg)
A Simpler not very Hidden MM
Nucleotides, no Indels, Unambiguous Path
G .1C .3A .2T .4
G .1C .1A .7T .1
G .3C .3A .1T .3
A0.7
T0.4
T0.3
1.0 1.0 1.0
€
lnP(D | M) = lnP(ED | state)states
∑ + lnP(x− > y)arcs
∑
![Page 16: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/16.jpg)
A Toy not-Hidden MMNucleotides, no Indels, Unambiguous
but Variable PathAll arcs out are equal
Example sequences: GATC ATC GC GAGAGC AGATTTC
BeginEmit G
Emit A
Emit C
Emit T
End
€
P(AGATTTC | M) = (0.5∗1.0)l= 7
Arc Emission
![Page 17: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/17.jpg)
A Simple HMMCpG Islands; Methylation Suppressed in
Promoter Regions; States are Really Hidden Now
G .1C .1A .4T .4
G .3C .3A .2T .2 0.1
0.2
CpG Non-CpG
0.8 0.9
€
P(stateyi |D <= i) = P(statex
i−1)∗P(x− > y)x
∑ *P(ED | stateyi )
Fractional likelihood
![Page 18: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/18.jpg)
The Forward AlgorithmProbability of a Sequence is the
Sum of All Paths that Can Produce It
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075
.1*(
.3*.2+
.1*.9)=.015
C
![Page 19: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/19.jpg)
The Forward AlgorithmProbability of a Sequence is the
Sum of All Paths that Can Produce It
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075
.1*(
.3*.2+
.1*.9)=.015
C
.3*(
.075*.8+
.015*.1)=.0185
.1*(
.075*.2+
.015*.9)=.0029
G
![Page 20: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/20.jpg)
The Forward AlgorithmProbability of a Sequence is the
Sum of All Paths that Can Produce It
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075
.1*(
.3*.2+
.1*.9)=.015
C
.3*(
.075*.8+
.015*.1)=.0185
.1*(
.075*.2+
.015*.9)=.0029
G
.2*(
.0185*.8+.0029*.1
)=.003.4*(.0185*.2+.0029*.9
)=.0025
A
.2*(
.003*.8+
.0025*.1
)=.0005.4*(.003*.2+.0025*.9
)=.0011
A
![Page 21: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/21.jpg)
The Forward AlgorithmProbability of a Sequence is the
Sum of All Paths that Can Produce It
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075
.1*(
.3*.2+
.1*.9)=.015
C
.3*(
.075*.8+
.015*.1)=.0185
.1*(
.075*.2+
.015*.9)=.0029
G
.2*(
.0185*.8+.0029*.1
)=.003.4*(.0185*.2+.0029*.9
)=.0025
A
.2*(
.003*.8+
.0025*.1
)=.0005.4*(.003*.2+.0025*.9
)=.0011
A
![Page 22: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/22.jpg)
The Viterbi AlgorithmMost Likely Path
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*m(
.3*.8,
.1*.1)=.072
.1*m(
.3*.2,
.1*.9)=.009
C
.3*m(
.075*.8,
.015*.1)=.0173
.1*m(
.075*.2,
.015*.9)=.0014
G
.2*m(
.0185*.8,.0029*.1
)=.0028.4*m(.0185*.2,.0029*.9
)=.0014
A
.2*m(
.003*.8,
.0025*.1
)=.00044.4*m(.003*.2,.0025*.9
)=.00050
A
![Page 23: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/23.jpg)
Forwards and BackwardsProbability of a State at a Position
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
C G
.2*(
.0185*.8+.0029*.1
)=.003.4*(.0185*.2+.0029*.9
)=.0025
A
.2*(
.003*.8+
.0025*.1
)=.0005.4*(.003*.2+.0025*.9
)=.0011
A
.003*(
.2*.8+
.4*.2)=.0007.0025*(.2*.1+
.4*.9)=.0009
€
Lki = fk (i)bk (i)
![Page 24: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/24.jpg)
Forwards and BackwardsProbability of a State at a Position
G C G A A
.003*(
.2*.8+
.4*.2)=.0007.0025*(.2*.1+
.4*.9)=.0009
€
P(CpG | i = 4,D)
=P(CpG)
P(CpG) + P(non −CpG)[ ]
=0.0007
0.0007 + 0.0009= 0.432
![Page 25: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/25.jpg)
Homology HMMGene recognition, identify distant
homologs
Common Ancestral SequenceMatch, site-specific emission probabilitiesInsertion (relative to ancestor), global emission probsDelete, emit nothingGlobal transition probabilities
![Page 26: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/26.jpg)
Homology HMM
start
insert insert
match
delete delete
match end
insert
![Page 27: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/27.jpg)
![Page 28: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/28.jpg)
Multiple Sequence Alignment HMM
Defines predicted homology of positions (sites)
Recognize region within longer sequenceModel domains or whole proteinsStructural alignmentCompare alternative models
Can modify model for sub-familiesIdeally, use phylogenetic tree
Often not much back and forthIndels a problem
![Page 29: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/29.jpg)
Model Comparison
Based on For ML, take
Usually to avoid numeric error
For heuristics, “score” isFor Bayesian, calculate
€
P(D |θ,M)
€
Pmax (D |θ,M)
€
−lnPmax (D |θ,M)
€
−log2 P(D |θ fixed ,M)
€
Pmax (θ,M |D) =P(D |θ,M) *P θ( ) *P M( )
P(D |θ,M) *P θ( ) *P M( )∑
![Page 30: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/30.jpg)
Parameters,
Types of parametersAmino acid distributions for positionsGlobal AA distributions for insert statesOrder of match statesTransition probabilitiesTree topology and branch lengthsHidden states (integrate or augment)
Wander parameter space (search)Maximize, or move according to
posterior probability (Bayes)
€
θ
![Page 31: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/31.jpg)
Expectation Maximization (EM)
Classic algorithm to fit probabilistic model parameters with unobservable states
Or missing data
Two Stages, iterateMaximize
If know hidden variables (states), maximize model parameters with respect to that knowledge
ExpectationIf know model parameters, find expected
values of the hidden variables (states)
Works well even with e.g., Bayesian to find near-equilibrium space
![Page 32: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/32.jpg)
Homology HMM EMStart with heuristic (e.g., ClustalW)Maximize
Match states are residues aligned in most sequencesAmino acid frequencies observed in
columns
ExpectationRealign all the sequences given model
Repeat until convergenceProblems: Local, not global
optimizationUse procedures to check how it worked
![Page 33: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/33.jpg)
Model ComparisonDetermining significance depends
on comparing two modelsUsually null model, H0, and test model,
H1
Models are nested if H0 is a subset of H1
If not nestedAkaike Iinformation Criterion (AIC) [similar
to empirical Bayes] or Bayes Factor (BF) [but be careful]
Generating a null distribution of statistic
Z-factor, bootstrapping, , parametric bootstrapping, posterior predictive
€
χν2
![Page 34: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/34.jpg)
Z Test MethodDatabase of known negative controls
E.g., non-homologous (NH) sequencesAssume NH scores
i.e., you are modeling known NH sequence scores as a normal distribution
Set appropriate significance level for multiple comparisons (more below)
ProblemsIs homology certain?Is it the appropriate null model?
Normal distribution often not a good approximation
Parameter control hard: e.g., length distribution
€
~ N(μ,σ )
![Page 35: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/35.jpg)
Bootstrapping and Parametric Models
Random sequence sampled from the same set of emission probability distributions
Same length is easyBootstrapping is re-sampling columnsParametric models use estimated frequencies, may include variance, tree, etc.
More flexible, can have more complex nullAllows you to consider carefully what the null means,
and what null is appropriate to use! Pseudocounts of global frequencies if data limit
Insertions relatively hard to modelWhat frequencies for insert states? Global?
![Page 36: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/36.jpg)
Homology HMM Resources
UCSC (Haussler)SAM: align, secondary structure
predictions, HMM parameters, etc.
WUSTL/Janelia (Eddy)Pfam: database of pre-computed HMM
alignments for various proteinsHMMer: program for building HMMs
![Page 37: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/37.jpg)
![Page 38: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/38.jpg)
Increasing Asymmetry with Increasing Single
Strandedness
e.g., P ( A=> G) = c + t
t = ( DssH * Slope ) + Intercept €
A C T G
A
C
T
G
− λ ACπ C λ ATπ T λ AGπG
λCAπ A − λCTπ T λCGπG
λTAπ A λTCπ C − λTGπG
λGAπ A λGCπ C λGTπ T −
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥ ⎥
⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−
−−
dfe
acb
fed
cba
G
T
C
A
GTCA
![Page 39: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/39.jpg)
2x Redundant Sites
![Page 40: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/40.jpg)
4x Redundant Sites
![Page 41: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/41.jpg)
Beyond HMMs
Neural netsDynamic Bayesian netsFactorial HMMsBoltzmann TreesKalman filtersHidden Markov random fields
![Page 42: Hidden Markov Models BIOL 7711 Computational Bioscience](https://reader035.fdocuments.net/reader035/viewer/2022062408/56813a47550346895da23ab8/html5/thumbnails/42.jpg)
COI Functional Regions
D
Water
KK
HD
H
Water
Oxygen
Electron
H (alt)
O2 + protons+ electrons = H2O + secondary proton pumping (=ATP)