. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text...

27
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings : Chapter 3 in the text book (Durbin et al.).
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of . Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text...

.

Hidden Markov ModelsLecture #5

Prepared by Dan Geiger.

Background Readings: Chapter 3 in the text book (Durbin et al.).

2

Dependencies along the genome

In previous classes we assumed every letter in a sequence is sampled randomly from some distribution q() over the alpha bet {A,C,T,G}.

This is not the case in true genomes. 1. genomic sequences come in triplets– the

codons– which encode amino acids via the genetic code.

2. there are special subsequences in the genome, like TATA within the regulatory area, upstream a gene.

3. The pairs C followed by G is less common than expected for random sampling.

We will focus on analyzing the third example using a model called Hidden Markov Model.

3

Example: CpG Island

In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG.

Hence the pair CG appears less than expected from what is expected from the independent frequencies of C and G alone.

Due to biological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of many genes.

These areas are called CpG islands (p denotes “pair”).

4

Example: CpG Island (Cont.)

We consider two questions (and some variants):

Question 1: Given a short stretch of genomic data, does it come from a CpG island ?

We use Markov Chains.

Question 2: Given a long piece of genomic data, does it contain CpG islands in it, where, what length ?

We use Hidden Markov Models.

5

(Stationary) Markov Chains

X1 X2 XL-1 XL

•Every variable xi has a domain. For example, suppose the domain are the letters {a, c, t, g}.•Every variable is associated with a local (transition) probability table p(Xi = xi | Xi-1= xi-1 ) and p(X1 = x1 ).•The joint distribution is given by

L

iiiii

LLLLnn

xXxXpxXp

xXxXpxXxXpxXpxXxXp

21111

1111221111

)|()(

)|()|()(),,(

L

iiiL xxpxxp

111 )|(),,( In short:

Stationary means that the transition probability tables do not depend on i.

6

Question 1: Using two Markov chains

X1 X2 XL-1 XL

For CpG islands:

We need to specify pI(xi | xi-1) where I stands for CpG Island.

Xi-1

Xi

A C T G

A 0.2 0.3 0.4 0.1

C 0.4 p(C | C) p(T| C) high

T 0.1 p(C | T) p(T | T) p(G | T)

G 0.3 p(C | G) p(T | G) p(G | G)

=1

Lines must add up to one; columns need not.

7

Question 1: Using two Markov chains

X1 X2 XL-1 XL

For non-CpG islands:

We need to specify pN(xi | xi-1) where N stands for Non CpG island.

Xi-1

Xi

A C T G

A 0.2 0.3 0.25 0.25

C 0.4 p(C | C) p(T | C) low

T 0.1 p(C | T) p(T | T) high

G 0.3 p(C | G) p(T | G) p(G | G)

Some entries may or may not change compared to pI(xi | xi-1) .

8

Question 1: Log Odds-Ratio test

Comparing the two options via odds-ratio test yields

)|(

)|(log

)(

)(loglog

1

1

1

1

iiN

iiI

iLN

LI

xxp

xxp

xxp

xxpQ

If logQ > 0, then CpG island is more likely.If logQ < 0, then non-CpG island is more likely.

9

Maximum Likelihood Estimate (MLE) of the parameters (with a teacher, labeled)

The needed parameters are:

pI(x1), pI (xi | xi-1), pN(x1), pN(xi | xi-1)

The ML estimates are given by:

Using MLE is justified when we have a large sample. The numbers appearing in the text book are based on 60,000 sequences. When only small samples are available, Bayesian learning is an attractive alternative, which we will cover soon.

X1 X2 XL-1 XL

aIa

IaI N

NaXp

,

,1 )( Where Na,I is the number of times letter a

appear in CpG islands in the dataset.

aIba

IbaiiI N

NbXaXp

,

,1 )|(

Where Nba,I is the number of times letter b appears after letter a in CpG islands in the dataset.

10

Hidden Markov Models (HMMs)

X1 X2 X3 Xi-1 Xi Xi+1R1 R2 R3 Ri-1 Ri Ri+1

X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1

),...,( 2

11111

1

)|()|()|()(),,(Lss

ii

L

iiiL srpsspsrpspssp

This HMM depicts the factorization:

Application in communication: message sent is (s1,…,sm) but we receive (r1,…,rm) . Compute what is the most likely message sent ?

Applications in Computational Biology: discussed in this and next few classes (CpG islands, Gene finding, Genetic linkage analysis).

kk transition matrix

11

HMM for finding CpG islandsQuestion 2: The input is a long sequence parts of which come from CpG islands and some don’t. We wish to find the most likely assignment of the two labels {I,N} to each letter in the sequence.

Domain(Hi)={I, N} {A,C,T,G} (8 states/values)

We define a variable Hi that encodes the letter at location i and the (hidden) label at that location. Namely,

H1 H2 HL-1 HL

These hidden variables Hi are assumed to form a Markov Chain:

The transition matrix is of size 88.

12

HMM for finding CpG islands (Cont)

The HMM:

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Domain(Hi)={I, N} {A,C,T,G} (8 values)

In this representation p(xi| hi) = 0 or 1 depending on whether xi is consistent with hi . E.g. xi= G is consistent with hi=(I,G) and with hi=(N,G) but not with any other state of hi. The size of the local probability table p(xi| hi) is 8 4.

Domain(Xi)= {A,C,T,G} (4 values)

13

Queries of interest (MAP)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

The Maximum A Posteriori query : ),,|,...,(argmax ),...,( 11

),...,(h

**1

1

LLh

L xxhhphhL

An efficient solution, assuming local probability tables (“the parameters”) are known, is called the Viterbi Algorithm.

Same problem if replaced by maximizing the joint distribution p(h1,…,hL,x1,..,xL)

An answer to this query gives the most probable N/I labeling for all locations.

14

Queries of interest (Belief Update) Posterior Decoding

1. Compute the posteriori belief in Hi (specific i) given the evidence {x1,…,xL} for each of Hi’s values hi, namely, compute p(hi | x1,…,xL).

2. Do the same computation for every Hi but without repeating the first task L times.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Local probability tables are assumed to be known. An answer to this query gives the probability of having label I or N at an arbitrary location.

15

Learning the parameters (EM algorithm)

A common algorithm to learn the parameters from unlabeled sequences is called the Expectation-Maximization (EM) algorithm. We will devote several classes to it. In the current context, we just say that it is an iterative algorithm repeating E-step and M-step until convergence.

The E-step uses the algorithms we develop in this class.

16

Decomposing the computation of Belief update (Posterior decoding)

P(x1,…,xL,hi) = P(x1,…,xi,hi) P(xi+1,…,xL | x1,…,xi,hi)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Equality due to Ind({xi+1,…,xL}, {x1,…,xi} | Hi}

= P(x1,…,xi,hi) P(xi+1,…,xL | hi) f(hi) b(hi)

Belief update: P(hi | x1,…,xL) = (1/K) P(x1,…,xL,hi)

where K= hi P(x1,…,xL,hi).

17

The forward algorithm

P(x1,x2,h2) = P(x1,h1,h2,x2) {Second step}

= P(x1,h1) P(h2 | x1,h1) P(x2 | x1,h1,h2)

h1

h1Last equality due to conditional independence = P(x1,h1) P(h2 | h1) P(x2 | h2)h1

H1 H2

X1 X2

Hi

Xi

The task: Compute f(hi) = P(x1,…,xi,hi) for i=1,…,L (namely, considering evidence up to time slot i).

P(x1, h1) = P(h1) P(x1|h1) {Basis step}

P(x1,…,xi,hi) = P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi)hi-1

{step i}

18

The backward algorithm

The task: Compute b(hi) = P(xi+1,…,xL|hi) for i=L-1,…,1 (namely, considering evidence after time slot i).

HL-1 HL

XL-1 XL

Hi Hi+1

Xi+1

P(xL| hL-1) = P(xL ,hL |hL-1) = P(hL |hL-1) P(xL |hL-1 ,hL )= hL hL

Last equality due to conditional independence

= P(hL |hL-1) P(xL |hL ) {first step}

hL

P(xi+1,…,xL|hi) = P(hi+1 | hi) P(xi+1 | hi+1) P(xi+2,…,xL| hi+1)

hi+1

{step i}

=b(hi)= =b(hi+1)=

19

The combined answer

1. To Compute the posteriori belief in Hi (specific i) given the evidence {x1,…,xL} run the forward algorithm and compute f(hi) = P(x1,…,xi,hi), run the backward algorithm to compute b(hi) = P(xi+1,…,xL|hi), the product f(hi)b(hi) is the answer (for every possible value hi).

2. To Compute the posteriori belief for every Hi

simply run the forward and backward algorithms once, storing f(hi) and b(hi) for every i (and value hi). Compute f(hi)b(hi) for every i.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

20

Consequence I: The E-step

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Recall that belief update has been computed via P(x1,…,xL,hi) = P(x1,…,xi,hi) P(xi+1,…,xL | hi) f(hi) b(hi)

Now we wish to compute (for the E-step)p(x1,…,xL,hi ,hi+1)=

= f(hi) p(hi+1|hi) p(xi+1| hi+1) b(hi+1)

p(x1,…,xi,hi) p(hi+1|hi)p(xi+1| hi+1)p(xi+2,…,xL |hi+1)

21

Consequence II: Likelihood of evidence

1. To compute the likelihood of evidence P(x1,…,xL), do one more step in the forward algorithm, namely,

f(hL) = P(x1,…,xL,hL)

2. Alternatively, do one more step in the backward algorithm, namely,

b(h1) P(h1) P(x1|h1) = P(x2,…,xL|h1) P(h1) P(x1|h1)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

hL

h1

hL

h1

22

Time and Space Complexity of the forward/backward algorithms

Time complexity is linear in the length of the chain, provided the number of states of each variable is a constant. More precisely, time complexity is O(k2L) where k is the maximum domain size of each variable.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Space complexity is also O(k2L).

23

The MAP query in HMM

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

1. Recall that the query asking likelihood of

evidence is to compute P(x1,…,xL) = P(x1,…,xL, h1,…,hL)

2. Now we wish to compute a similar quantity:

P*(x1,…,xL) = MAX P(x1,…,xL, h1,…,hL)(h1,…,hL)

(h1,…,hL)

And, of course, we wish to find a MAP assignment (h1

*,…,hL*) that brought about this maximum.

24

Example: Revisiting likelihood of evidence

H1 H2

X1 X2

H3

X3

P(x1,x2,x3) = P(h1)P(x1|h1) P(h2|h1)P(x2|h2) P(h3 |h2)P(x3|h3)h3h2h1

= P(h1)P(x1|h1) b(h2) P(h1|h2)P(x2|h2)h1 h2

= b(h1) P(h1)P(x1|h1)h1

25

Example: Computing the MAP assignment

H1 H2

X1 X2

H3

X3

maximum = max P(h1)P(x1|h1) max P(h2|h1)P(x2|h2) max P(h3 |h2)P(x3|h3)

h3h2h1

= max P(h1)P(x1|h1) max b (h2) P(h1|h2)P(x2|h2)h1 h2h3

Replace sums with taking maximum:

= max b (h1) P(h1)P(x1|h1)

h1 h2{Finding the maximum}

h1* = arg max b (h1) P(h1)P(x1|

h1)h1

h2{Finding the map assignment}

h2* = x* (h1

*); h2

x* (h2)h3

x* (h1)h2

h3* = x* (h2

*)h3

26

Viterbi’s algorithm

For i=1 to L-1 do

h1* = ARG MAX P(h1) P(x1|h1) b (h1)

h2h2

hi+1* = x* (hi *)hi+1

Forward phase (Tracing the MAP assignment) :

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

x* (hi) = ARGMAX P(hi+1 | hi) P(xi+1 | hi+1) b (hi+1)

For i=L-1 downto 1 dob (hi) = MAX P(hi+1 | hi) P(xi+1 | hi+1) b (hi+1)

hi+1hi+1 hi+2

b (hL) = 1hL+1

hi+1hi+1 hi+2

Backward phase:

(Storing the best value as a function of the parent’s values)

27

Summary of HMM

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

1. Belief update = posterior decoding• Forward-Backward algorithm

2. Maximum A Posteriori assignment• Viterbi algorithm

3. Learning parameters• The EM algorithm• Viterbi training