CISC 636 Computational Biology & Bioinformatics (Fall 2016...

14
CISC636, F16, Lec11, Liao CISC 636 Computational Biology & Bioinformatics (Fall 2016) Hidden Markov Models (III) - Viterbi training - Baum-Welch algorithm - Maximum Likelihood - Expectation Maximization

Transcript of CISC 636 Computational Biology & Bioinformatics (Fall 2016...

CISC636, F16, Lec11, Liao

CISC 636 Computational Biology &

Bioinformatics

(Fall 2016)

Hidden Markov Models (III) - Viterbi training

- Baum-Welch algorithm

- Maximum Likelihood

- Expectation Maximization

CISC636, F16, Lec11, Liao

Model building - Topology

- Requires domain knowledge

- Parameters

- When states are labeled for sequences of observables

- Simple counting (Maximum Likelihood):

akl = Akl / l’Akl’ and ek(b) = Ek(b) / b’Ek(b’)

- When states are not labeled

Method 1 (Viterbi training)

1. Assign random parameters

2. Use Viterbi algorithm for labeling/decoding

2. Do counting to collect new akl and ek(b);

3. Repeat steps 2 and 3 until stopping criterion is met.

Method 2 (Baum-Welch algorithm)

CISC636, F16, Lec11, Liao

Baum-Welch algorithm (Expectation-Maximization)

• An iterative procedure similar to Viterbi training

• Probability that akl is used at position i in sequence j.

P(πi = k, πi+1= l | x,θ ) = fk(i) akl el (xi+1) bl(i+1) / P(xj)

Calculate the expected number of times that is used by summing over all position and over all training sequences.

Akl = j {(1/P(xj) [i fkj(i) akl el (x

ji+1) bl

j(i+1)] }

Similarly, calculate the expected number of times that symbol b is emitted in state k.

Ek(b) =j {(1/P(xj) [{i|x_i^j = b} fkj(i) bk

j(i)] }

CISC636, F16, Lec11, Liao

Maximum Likelihood

Define L() = P(x| )

Estimate such that the distribution with the estimated best agrees with or support the data observed so far.

ML = argmax L()

E.g. There are red and black balls in a box. What is the probability P of picking up a black ball?

Do sampling (with replacement).

Maximum Likelihood

Define L() = P(x| )

Estimate such that the distriibution with the estimated best agrees with or supports the

data observed so far.

ML= argmax L()

When L() is differentiable,

For example, want to know the ratio: # of blackball/# of whiteball, in other words, the

probability P of picking up a black ball. Sampling (with replacement):

Prob ( iid) = p9 (1-p) 91

Likelihood L(p) = p9(1-p)91.

=> PML = 9/100 = 9%. The ML estimate of P is just the frequency.

8 91 9 90( )9 (1 ) 91 (1 ) 0

L pp p p p

P

|

( )0

ML

L

100

times

Counts: whiteball 91,

blackball 9

CISC636, F16, Lec11, Liao

A proof that the observed frequency -> ML estimate of

probabilities for polynomial distribution

Let Counts ni for outcome i

The observed frequencies i = ni /N, where N = i ni

If i ML = ni /N, then P(n| ML ) > p(n| ) for any ML

Proof:

( )

( | )log log log ( )

( | ) ( )

log( ) N log( ) log( )

H( || ) 0

i

i

i

nMLnMLiML

i i

nii i

i

ML ML MLMLi i i i

i i

i i ii i i

ML

P n

P n

nn

N

CISC636, F16, Lec11, Liao

Maximum Likelihood: pros and cons

- Consistent, i.e., in the limit of a large amount of data, ML

estimate converges to the true parameters by which the data

are created.

- Simple

- Poor estimate when data are insufficient.

e.g., if you roll a die for less than 6 times, the ML estimate

for some numbers would be zero.

Pseudo counts:

where A = i i

,i ii

n

N A

CISC636, F16, Lec11, Liao

Conditional Probability and Join Probability

CISC636, F16, Lec11, Liao

P(one) = 5/13

P(square) = 8/13

P(one, square) = 3/13

P(one | square) = 3/8 = P(one, square) / P(square)

In general, P(D,M) = P(D|M)P(M) = P(M|D)P(D)

=> Baye’s Rule: ( | ) ( )(M | D)

( )

P D M P MP

P D

Conditional Probability and Conditional Independence

CISC636, F16, Lec11, Liao

Baye’s Rule:

Example: disease diagnosis/inference P(Leukemia | Fever) = ? P(Fever | Leukemia) = 0.85 P(Fever) = 0.9 P(Leukemia) = 0.005 P(Leukemia | Fever) = P(F|L)P(L)/P(F) = 0.85*0.01/0.9 = 0.0047

CISC636, F16, Lec11, Liao

( | ) ( )(M | D)

( )

P D M P MP

P D

Bayesian Inference

Maximum a posterior estimate

arg max ( | x)MAP P

CISC636, F16, Lec11, Liao

Expectation Maximization

log ( | ) log ( , | ) log (y | x, )P x P x y P

P( , | ) ( | , ) (x | )x y P y x P

( | ) ( , | ) / ( | , )P x P x y P y x

log ( | ) ( | , ) log ( , | ) ( | , ) log ( | x, )t t

y y

P x P y x P x y P y x P y

( | ) ( | , ) logP(x, y | )t t

y

Q P y x

1

log ( | ) log ( | )

( | , )( | ) ( | ) ( | , ) log

( | , )

( | ) ( | )

arg max ( | )

t

tt t t t

y

t t t

t t

P x P x

P y xQ Q P y x

P y x

Q Q

Q

( | , )t

y

P y x ( ) Expectation

Maximization CISC636, F16, Lec11, Liao

EM explanation of the Baum-Welch algorithm

( | ) ( | , )P x P x

( | ) ( | , ) log ( , | )t tQ P x P x

We like to

maximize by

choosing

But state path is

hidden variable. Thus,

EM.

CISC636, F16, Lec11, Liao

EM Explanation of the Baum-Welch algorithm

A-term E-term

A-term is maximized if

E-term is maximized if

'

'

EM klkl

kl

l

Aa

A

'

( )( )

( ')

EM kk

k

b

E be b

E b

CISC636, F16, Lec11, Liao