Slide3 HMM

8/8/2019 Slide3 HMM

1/51

Hidden Markov Models

Ts. Nguyn Vn Vinh

B mn KHMT, Trng HCN, H QG H ni

8/8/2019 Slide3 HMM

2/51

2

Why Learn ? Machine learning is programming computers to optimize a performance

criterion using example data or past experience.

There is no need to learn to calculate payroll

Learning is used when: Human expertise does not exist (navigating on Mars),

Humans are unable to explain their expertise (speech recognition)

Solution changes in time (routing on a computer network)

Solution needs to be adapted to particular cases (user biometrics)

8/8/2019 Slide3 HMM

3/51

3

WhatWe Talk AboutWhen We Talk

AboutLearning Learning general models from a data of particular examples

Data is cheap and abundant (data warehouses, data marts); knowledge is

expensive and scarce.

Example in retail: Customer transactions to consumer behavior:

People who bought Da Vinci Code also bought The Five People You Meet

in Heaven (www.amazon.com)

Build a model that is a good and useful approximation to the data.

8/8/2019 Slide3 HMM

4/51

4

What is Machine Learning? Optimize a performance criterion using example data or past

experience.

Role of Statistics: Inference from a sample Role of Computer science: Efficient algorithms to

Solve the optimization problem

Representing and evaluating the model for inference

8/8/2019 Slide3 HMM

5/51

8/8/2019 Slide3 HMM

6/51

Part-of-Speech Tagging Input:

Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO

Alan Mulally announced first quarter results.

Output:

Profits/Nsoared/Vat/PBoeing/NCo./N,/, easily/ADVtopping/Vforecasts/Non/P

Wall/NStreet/N,/, as/Ptheir/POSSCEO/NAlan/NMulally/Nannounced/V

first/ADJquarter/Nresults/N./.

N= Noun

V= Verb

P= Preposition

Adv=Adverb

Adj=Adjective

.

8/8/2019 Slide3 HMM

7/51

7

Face RecognitionTraining examples of a person

Test images

AT&T Laboratories, Cambridge UKhttp://www.uk.research.att.com/facedatabase.html

8/8/2019 Slide3 HMM

8/51

8

Introduction Modeling dependencies in input

Sequences: Temporal: In speech; phonemes in a word (dictionary), words in

a sentence (syntax, semantics of the language).

In handwriting, pen movements

Spatial: In a DNA sequence; base pairs

8/8/2019 Slide3 HMM

9/51

9

Discrete Markov Process N states: S1, S2, ..., SN State at time t, qt= Si First-order Markov

P(qt+1

=Sj

| qt=S

i, q

t-1=S

k,...) = P(q

t+1=S

j| q

t=S

i)

Transition probabilities

aij P(qt+1=Sj | qt=Si) aij 0 and j=1Naij=1

Initial probabilities

i P(q1=Si) j=1Ni=1

8/8/2019 Slide3 HMM

10/51

10

Stochastic Automaton

TT qqqqq

T

t

tt aaqqPqP,QOP 12112

11 || T!!! !

.4A

8/8/2019 Slide3 HMM

11/51

11

Example: Balls and Urns Three urns each full of balls of one color

S1: red, S2: blue, S3: green

? A

_ a

048080304050

||||

801010

206020

303040

302050

3313111

3313111

3311

.....

aaa

SSSSSSS,

S,S,S,S

...

...

...

.,.,.T

!!

T!

!

!

-

!!

4

4

A

A

8/8/2019 Slide3 HMM

12/51

12

Balls and Urns: Learning Given K example sequences of length T

_ a_ a

_ a_ a

!

!

!

!!!

!

!!!T

k

T-

t ik

t

k

T-

t jk

tik

t

i

ji

ij

k ik

ii

S

SS

S#

SS#a

KS

#S#

1

1

1

1 1

1

1

nd1

fromstr nsition

tofromstr nsition

1

sequen esithst rtinsequen es

8/8/2019 Slide3 HMM

13/51

13

Hidden Markov Models States are not observable

Discrete observations {v1,v2,...,vM} are recorded; a

probabilistic function of the state Emission probabilities

bj(m) P(Ot=vm | qt=Sj)

Example: In each urn, there are balls of different colors,but with different probabilities.

For each observation sequence, there are multiple state

sequences

8/8/2019 Slide3 HMM

14/51

Hidden Markov Model (HMM) HMMs allow you to estimate probabilities of

unobserved events

Given plain text, which underlying parameters

generated the surface

E.g., in speech recognition, the observed data is

the acoustic signal and the words are the hiddenparameters

8/8/2019 Slide3 HMM

15/51

HMMs and their Usage HMMs are very common in Computational Linguistics:

Speech recognition (observed: acoustic signal, hidden: words)

Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden:part-of-speech

tags)

Machine translation (observed: foreign words, hidden: words in

target language)

8/8/2019 Slide3 HMM

16/51

Noisy Channel Model In speech recognition you observe an acoustic

signal (A=a1,,an) and you want to determine the

most likely sequence of words (W=w1,,wn): P(W |A)

Problem: A andWare too specific for reliable

counts on observed data, and are very unlikely tooccur in unseen data

8/8/2019 Slide3 HMM

17/51

Noisy Channel Model Assume that the acoustic signal (A) is already segmented

wrt word boundaries

P(W

| A) could be computed as

Problem: Finding the most likely word corresponding to a

acoustic representation depends on the context E.g., /'pre-z&ns / could mean presents or presence

depending on the context

P(W | A) ! maxw ia i

P(wi | ai )

8/8/2019 Slide3 HMM

18/51

Noisy Channel Model Given a candidate sequence Wwe need to

compute P(W) and combine it with P(W | A)

Applying Bayes rule:

The denominator P(A) can be dropped, because itis constant for allW

argmaxW

(W ) ! argmaxW

( W) (W)

( )

8/8/2019 Slide3 HMM

19/51

19

Noisy Channel in a Picture

8/8/2019 Slide3 HMM

20/51

DecodingThe decoder combines evidence from

The likelihood: P(A | W)

This can be approximated as:

The prior: P(W)

This can be approximated as:

P(W) } P(w1 ) P(w ii! 2

n

| wi1)

P(A |W) } P(aii!1

n

| wi )

8/8/2019 Slide3 HMM

21/51

Search Space Given a word-segmented acoustic sequence list all

candidates

Compute the most likely path

'bot ik-'spen-siv 'pre-z &ns

boat excessive presidents

bald expensive presence

bold expressive presents

bought inactive press

P(inactive | bald)

P('bot bald)

8/8/2019 Slide3 HMM

22/51

Markov Assumption The Markov assumption states that probability of

the occurrence of word wiat time t depends only

on occurrence of word wi-1 at time t-1 Chain rule:

Markov assumption:

P(w1,...,wn ) } P(wi | w i1)i! 2

n

P(w1,...,wn ) ! P(wi w1,...,wi1)i! 2

n

8/8/2019 Slide3 HMM

23/51

The Trellis

8/8/2019 Slide3 HMM

24/51

Parameters of an HMM States:A set of states S=s1,,sn Transition probabilities: A= a1,1,a1,2,,an,n Each ai,j

represents the probability of transitioning from state si to sj. Emission probabilities: A set B of functions of the form

bi(ot) which is the probability of observation ot beingemitted by si

Initial state distribution: is the probability that si is a startstate

T i

8/8/2019 Slide3 HMM

25/51

The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence

O=o1,,oTand an HMM model

, how do we compute the probability of Ogiven the model?

Problem 2(Decoding): Given the observation sequence

O=o1,,oTand an HMM model

, how do we find the state sequence that best

explains the observations?

P ! (A,B,T )

P ! (A,B,T )

8/8/2019 Slide3 HMM

26/51

Problem 3 (Learning): How do we adjust the model

parameters , to maximize ?

The Three Basic HMM Problems

P ! (A,B,T )

P(O P)

8/8/2019 Slide3 HMM

27/51

Problem 1: Probability of an Observation

Sequence What is ?

The probability of a observation sequence is the sum of

the probabilities of all possible state sequences in theHMM.

Nave computation is very expensive. Given Tobservations and N states, there are NTpossible statesequences.

Even small HMMs, e.g. T=10 and N=10, contain 10 billiondifferent paths

Solution to this and problem 2 is to use dynamic

programming

P(O | P)

8/8/2019 Slide3 HMM

28/51

Forward Probabilities What is the probability that, given an HMM , at

time t the state is i and the partial observation o1

othas been generated?

E t(i) ! P(o1... ot, qt ! si | P)

P

8/8/2019 Slide3 HMM

29/51

Forward Probabilities

t(j) ! E t1(i) aiji!1

N

-

bj (ot)

Et(i) ! P(o1...ot, qt ! si | P)

8/8/2019 Slide3 HMM

30/51

Forward Algorithm Initialization:

Induction:

Termination:

E t(j) ! E t1(i) aiji!1

N

-

bj (ot) 2 e te T, 1e j e N

E1(i) ! T ibi(o1) 1e i e N

P(O | P) ! ET(i)i!1

N

8/8/2019 Slide3 HMM

31/51

Forward Algorithm Complexity In the nave approach to solving problem 1 it takes

on the order of2T*NTcomputations

The forward algorithm takes on the order ofN2Tcomputations

8/8/2019 Slide3 HMM

32/51

Backward ProbabilitiesAnalogous to the forward probability, just in the

other direction

What is the probability that given an HMM andgiven the state at time t is i, the partial observation

ot+1 oT is generated?

F t(i) ! P(ot1...o |qt ! si,P)

P

8/8/2019 Slide3 HMM

33/51

Backward Probabilities

Ft(i) ! aijbj (ot1)Ft1(j)j!1

N

-

Ft(i) ! P(ot1...o |qt ! si,P)

8/8/2019 Slide3 HMM

34/51

Backward Algorithm Initialization:

Induction:

Termination:

FT(i) !1, 1e i e N

F t(i) ! aijbj (ot1)F t1(j)j!1

N

-

t! T1...1,1 e i e N

P(O | P) ! T iF1(i)i!1

N

8/8/2019 Slide3 HMM

35/51

Problem2: Decoding

The solution to Problem 1 (Evaluation) gives us the sum of

all paths through an HMM efficiently.

For Problem 2, we wan to find the path with the highestprobability.

We want to find the state sequence Q=q1qT, such that

Q ! argmaxQ'

P(Q' |O,P)

8/8/2019 Slide3 HMM

36/51

Viterbi Algorithm Similar to computing the forward probabilities, but

instead of summing over transitions from incoming

states, compute the maximum Forward:

Viterbi Recursion:

E t(j) ! E t1(i) aiji!1

N

-

bj (ot)

Ht(j) ! max1e ieN

Ht1(i)aij? Abj (ot)

8/8/2019 Slide3 HMM

37/51

Viterbi Algorithm Initialization:

Induction:

Termination:

Read out path:

H1(i) ! T ibj (o1) 1e i e N

Ht(j) ! max1e ieN

Ht1(i)aij? Abj (ot)

]t(j) ! argmax1e ieN

Ht1(i)aij

-

2 e te T, 1 e j e N

p* ! max

1e ieNHT(i) qT

*! argmax

1e ieN

HT(i)

qt

* !]t1

(qt1

*) t! T1,...,1

8/8/2019 Slide3 HMM

38/51

Problem3: Learning

Up to now weve assumed that we know the underlyingmodel

Often these parameters are estimated on annotatedtraining data, which has two drawbacks: Annotation is difficult and/or expensive

Training data is different from the current data

We want to maximize the parameters with respect to thecurrent data, i.e., were looking for a model , such that

! (A,B, )

P' P'! argmaxP

P(O | P)

8/8/2019 Slide3 HMM

39/51

Problem3: Learning

Unfortunately, there is no known way to analytically find aglobal maximum, i.e., a model , such that

But it is possible to find a local maximum Given an initial model , we can always find a model ,

such that

P'

P'! argmaxP

P(O | P)

P

P'

P(O | P') u P(O | P)

8/8/2019 Slide3 HMM

40/51

Parameter Re-estimation Use the forward-backward (or Baum-Welch)

algorithm, which is a hill-climbing algorithm

Using an initial parameter instantiation, theforward-backward algorithm iteratively re-estimatesthe parameters and improves the probability thatgiven observation are generated by the new

parameters

8/8/2019 Slide3 HMM

41/51

Parameter Re-estimation Three parameters need to be re-estimated:

Initial state distribution:

Transition probabilities: ai,j

Emission probabilities: bi(ot)

T i

8/8/2019 Slide3 HMM

42/51

Re-estimating Transition Probabilities Whats the probability of being in state siat time t

and going to state sj, given the current model and

parameters?

\t(i, j) ! P(qt ! si, qt1 ! sj | O,P)

8/8/2019 Slide3 HMM

43/51

Re-estimating Transition Probabilities

t(i, j) !E t(i) ai, j bj (ot1)Ft1(j)

E t(i) ai, j bj (ot1)F t1(j)j!1

N

i!1

N

t(i, j) ! P(qt ! si, qt1 ! sj |O,P)

8/8/2019 Slide3 HMM

44/51

Re-estimating Transition Probabilities The intuition behind the re-estimation equation for

transition probabilities is

Formally:

ai, j !expected numberoftransitions from state s ito state sj

expected numberoftransitions from state s i

ai, j !\t(i, j)

t!1

T1

\t(i, j ')j '!1

N

t!1

T1

8/8/2019 Slide3 HMM

45/51

Re-estimating Transition Probabilities Defining

As the probability of being in state si, given the

complete observation O

We can say: ai, j !

\t(i, j)t!1

T1

Kt(i)t!1

T1

Kt(i) ! \t(i, j)j!1

N

8/8/2019 Slide3 HMM

46/51

Review of Probabilities Forward probability:

The probability of being in state si, given the partial observationo1,,ot

Backward probability:

The probability of being in state si, given the partial observationot+1,,oT

Transition probability:

The probability of going from state si, to state sj, given thecomplete observation o1,,oT

State probability:

The probability of being in state si, given the complete

observation o1,,oT

E t(i)

F t(i)

\t(i, j)

Kt(i)

8/8/2019 Slide3 HMM

47/51

Re-estimating Initial State Probabilities Initial state distribution: is the probability that si is

a start state

Re-estimation is easy:

Formally:

T i

Ti ! expected numbero times in state s iattime1

Ti ! K1(i)

8/8/2019 Slide3 HMM

48/51

Re-estimation of Emission Probabilities Emission probabilities are re-estimated as

Formally:

Where

Note that here is the Kronecker delta function and is not related to thein the discussion of the Viterbi algorithm!!

bi(k) !

expected numberoftimes in state siand observe symbol vk

exp

ectednumb

erof

tim

es in s

tate

s i

bi (k) !

H(ot,vk)Kt(i)t!1

T

Kt(i)t!1

T

H(ot,vk) ! 1, ifot ! vk, and 0 otherwise

H

H

8/8/2019 Slide3 HMM

49/51

The Updated Model Coming from we get to

by the following update rules:

P ! (A,B,T )

P'! ( A, B, T)

bi (k) !

H(ot,vk)Kt(i)t!1

T

Kt(i)t!1

T

ai, j !

\t(i, j)t!1

T1

Kt(i)t!1

T1

Ti ! K1(i)

8/8/2019 Slide3 HMM

50/51

Expectation Maximization The forward-backward algorithm is an instance of

the more general EM algorithm

The E Step: Compute the forward and backwardprobabilities for a give model

The M Step: Re-estimate the model parameters

8/8/2019 Slide3 HMM

51/51

Exercise Programming with Viterbi Algorithm

Apply HMM forPart-of-Speech Tagging

Slide3 HMM

Documents

Transcript of Slide3 HMM