CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12,...
Click here to load reader
-
Upload
ezra-perkins -
Category
Documents
-
view
215 -
download
0
description
Transcript of 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12,...
![Page 1: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/1.jpg)
1
Applications of Hidden Markov Models
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Nov. 12, 2005
ChengXiang ZhaiDepartment of Computer Science
University of Illinois, Urbana-Champaign
![Page 2: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/2.jpg)
2
Today’s Lecture
• HMM Applications– Profile HMMs (Classification)– HMMs for Multiple Sequence Alignment
(Pattern discovery)– HMMs for Gene Finding (Segmentation)
• Special issues in HMMs– Local Maximas– Model construction– Weighting training sequences
![Page 3: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/3.jpg)
3
HMM Applications
• Classification (e.g., Profile HMMs)– Build an HMM for each class (profile HMMs)– Classify a sequence using Bayes rule
• Multiple sequence alignment– Build an HMM based on a set of sequences– Decode each sequence to find a multiple alignment
• Segmentation (e.g., gene finding)– Use different states to model different regions– Decode a sequence to reveal the region boundaries
![Page 4: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/4.jpg)
4
HMMs for Classification
1{ ,..., }( | ) ( )( | )
( )* arg max ( | ) ( )
k
C
C C Cp X C p Cp C X
p XC p X C p C
p(X|C) is modeled by a profile HMM built specifically for C
Assuming example sequences are available for C
E.g., Protein families
Assign a family to X
(Profile HMM will be covered in the next lecture)
![Page 5: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/5.jpg)
5
HMMs for Motif Finding• Given a set of sequences S={X1, …,Xk}
• Design an HMM with two kinds of states– Background states: For outside a motif– Motif states: For modeling a motif
• Train the HMM, e.g., using Baum-Welch (finding the HMM that maximizes the probability of S)
• The “motif part” of the HMM gives a motif model (e.g., a PWM)
• The HMM can be used to scan any sequence (including Xi) to figure out where the motif is.
• We may also decode each sequence Xi to obtain a set of subsequences matched by the motif (e.g., a multiset of k-mers)
![Page 6: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/6.jpg)
6
HMMs for Multiple Alignment
• Given a set of sequences S={X1, …,Xk}
• Train an HMM, e.g., using Baum-Welch (finding the HMM that maximizes the probability of S)
• Decode each sequence Xi
• Assemble the Viterbi paths to form a multiple alignment – The symbols belonging to the same state will be
aligned to each other
• To be covered in the next lecture…
![Page 7: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/7.jpg)
7
HMM-based Gene Finding• Design two types of states
– “Within Gene” States– “Outside Gene” States
• Use known genes to estimate the HMM• Decode a new sequence to reveal which part is a gene• Example software:
– GENSCAN (Burge 1997)– FGENESH (Solovyev 1997)– HMMgene (Krogh 1997)– GENIE (Kulp 1996)– GENMARK (Borodovsky & McIninch 1993)– VEIL (Henderson, Salzberg, & Fasman 1997)
![Page 8: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/8.jpg)
8
VEIL: Viterbi Exon-Intron LocatorExon HMM Model
Upstream
Start Codon
Exon
Stop Codon
Downstream
3’ Splice Site
Intron
5’ Poly-A Site
5’ Splice Site
• Enter: start codon or intron (3’ Splice Site)
• Exit: 5’ Splice site or three stop codons (taa, tag, tga)
VEIL Architecture
(Slide from N. F. Samatova’s lecture)
![Page 9: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/9.jpg)
9
GenScan Architecture
• It is based on Generalized HMM (GHMM)
• Model both strands at once– Other models: Predict on one
strand first, then on the other strand
– Avoids prediction of overlapping genes on the two strands (rare)
• Each state may output a string of symbols (according to some probability distribution).
• Explicit intron/exon length modeling
• Special sensors for Cap-site and TATA-box
• Advanced splice site sensors
Fig. 3, Burge and Karlin 1997
![Page 10: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/10.jpg)
10
Special Issues
• Local maxima
• Optimal model construction
• Weighting training sequences
![Page 11: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/11.jpg)
11
Solutions to the Local Maxima Problem
• Repeat with different initializations
• Start with the most reasonable initial model
• Simulated annealing (slow down the convergence speed)
![Page 12: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/12.jpg)
12
Local Maxima: Illustration
Global maximaLocal maxima
Good starting pointBad starting point
![Page 13: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/13.jpg)
13
Optimal Model Construction
( | ) ( )( | )( )
* arg max ( | )arg max ( | ) ( )
HMM
HMM
p X HMM p HMMp HMM Xp X
HMM p HMM Xp X HMM p HMM
Bayesian model selection: -P(HMM) should prefer simpler models (i.e., more constrained, fewer states, fewer transitions)-P(HMM) could reflect our prior on the parameters
![Page 14: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/14.jpg)
14
Sequence Weighting
• Avoid over-counting similar sequences from the same organisms
• Typically compute a weight for a sequence based on an evolutionary tree
• Many ways to incorporate the weights, e.g.,– Unequal likelihood– Unequal weight contribution in parameter
estimation
![Page 15: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/15.jpg)
15
HMMs in Real Applications
• SAM-T98 Tutorial:– http://www.cse.ucsc.edu/research/compbio/ismb99.tutorial.html
• Pfam– http://www.sanger.ac.uk/Software/Pfam/
![Page 16: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.](https://reader038.fdocuments.net/reader038/viewer/2022100505/5a4d1b947f8b9ab0599c2d5e/html5/thumbnails/16.jpg)
16
What You Should Know
• How an HMM can be used to classify sequences
• How an HMM can be used to align sequences and discover motifs
• How an HMM can be used to segment sequences (e.g., gene finding)
• Know the problem of local maxima and possible solutions