1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12,...

16

Click here to load reader

description

3 HMM Applications Classification (e.g., Profile HMMs) –Build an HMM for each class (profile HMMs) –Classify a sequence using Bayes rule Multiple sequence alignment –Build an HMM based on a set of sequences –Decode each sequence to find a multiple alignment Segmentation (e.g., gene finding) –Use different states to model different regions –Decode a sequence to reveal the region boundaries

Transcript of 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12,...

Page 1: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

1

Applications of Hidden Markov Models

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Nov. 12, 2005

ChengXiang ZhaiDepartment of Computer Science

University of Illinois, Urbana-Champaign

Page 2: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

2

Today’s Lecture

• HMM Applications– Profile HMMs (Classification)– HMMs for Multiple Sequence Alignment

(Pattern discovery)– HMMs for Gene Finding (Segmentation)

• Special issues in HMMs– Local Maximas– Model construction– Weighting training sequences

Page 3: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

3

HMM Applications

• Classification (e.g., Profile HMMs)– Build an HMM for each class (profile HMMs)– Classify a sequence using Bayes rule

• Multiple sequence alignment– Build an HMM based on a set of sequences– Decode each sequence to find a multiple alignment

• Segmentation (e.g., gene finding)– Use different states to model different regions– Decode a sequence to reveal the region boundaries

Page 4: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

4

HMMs for Classification

1{ ,..., }( | ) ( )( | )

( )* arg max ( | ) ( )

k

C

C C Cp X C p Cp C X

p XC p X C p C

p(X|C) is modeled by a profile HMM built specifically for C

Assuming example sequences are available for C

E.g., Protein families

Assign a family to X

(Profile HMM will be covered in the next lecture)

Page 5: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

5

HMMs for Motif Finding• Given a set of sequences S={X1, …,Xk}

• Design an HMM with two kinds of states– Background states: For outside a motif– Motif states: For modeling a motif

• Train the HMM, e.g., using Baum-Welch (finding the HMM that maximizes the probability of S)

• The “motif part” of the HMM gives a motif model (e.g., a PWM)

• The HMM can be used to scan any sequence (including Xi) to figure out where the motif is.

• We may also decode each sequence Xi to obtain a set of subsequences matched by the motif (e.g., a multiset of k-mers)

Page 6: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

6

HMMs for Multiple Alignment

• Given a set of sequences S={X1, …,Xk}

• Train an HMM, e.g., using Baum-Welch (finding the HMM that maximizes the probability of S)

• Decode each sequence Xi

• Assemble the Viterbi paths to form a multiple alignment – The symbols belonging to the same state will be

aligned to each other

• To be covered in the next lecture…

Page 7: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

7

HMM-based Gene Finding• Design two types of states

– “Within Gene” States– “Outside Gene” States

• Use known genes to estimate the HMM• Decode a new sequence to reveal which part is a gene• Example software:

– GENSCAN (Burge 1997)– FGENESH (Solovyev 1997)– HMMgene (Krogh 1997)– GENIE (Kulp 1996)– GENMARK (Borodovsky & McIninch 1993)– VEIL (Henderson, Salzberg, & Fasman 1997)

Page 8: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

8

VEIL: Viterbi Exon-Intron LocatorExon HMM Model

Upstream

Start Codon

Exon

Stop Codon

Downstream

3’ Splice Site

Intron

5’ Poly-A Site

5’ Splice Site

• Enter: start codon or intron (3’ Splice Site)

• Exit: 5’ Splice site or three stop codons (taa, tag, tga)

VEIL Architecture

(Slide from N. F. Samatova’s lecture)

Page 9: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

9

GenScan Architecture

• It is based on Generalized HMM (GHMM)

• Model both strands at once– Other models: Predict on one

strand first, then on the other strand

– Avoids prediction of overlapping genes on the two strands (rare)

• Each state may output a string of symbols (according to some probability distribution).

• Explicit intron/exon length modeling

• Special sensors for Cap-site and TATA-box

• Advanced splice site sensors

Fig. 3, Burge and Karlin 1997

Page 10: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

10

Special Issues

• Local maxima

• Optimal model construction

• Weighting training sequences

Page 11: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

11

Solutions to the Local Maxima Problem

• Repeat with different initializations

• Start with the most reasonable initial model

• Simulated annealing (slow down the convergence speed)

Page 12: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

12

Local Maxima: Illustration

Global maximaLocal maxima

Good starting pointBad starting point

Page 13: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

13

Optimal Model Construction

( | ) ( )( | )( )

* arg max ( | )arg max ( | ) ( )

HMM

HMM

p X HMM p HMMp HMM Xp X

HMM p HMM Xp X HMM p HMM

Bayesian model selection: -P(HMM) should prefer simpler models (i.e., more constrained, fewer states, fewer transitions)-P(HMM) could reflect our prior on the parameters

Page 14: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

14

Sequence Weighting

• Avoid over-counting similar sequences from the same organisms

• Typically compute a weight for a sequence based on an evolutionary tree

• Many ways to incorporate the weights, e.g.,– Unequal likelihood– Unequal weight contribution in parameter

estimation

Page 15: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

15

HMMs in Real Applications

• SAM-T98 Tutorial:– http://www.cse.ucsc.edu/research/compbio/ismb99.tutorial.html

• Pfam– http://www.sanger.ac.uk/Software/Pfam/

Page 16: 1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

16

What You Should Know

• How an HMM can be used to classify sequences

• How an HMM can be used to align sequences and discover motifs

• How an HMM can be used to segment sequences (e.g., gene finding)

• Know the problem of local maxima and possible solutions