A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

72
A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1

Transcript of A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

Page 1: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

1

A Revealing Introduction to Hidden Markov Models

Mark Stamp

Revealing Introduction to HMMs

Page 2: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

2

Hidden Markov Models

What is a hidden Markov model (HMM)?o A machine learning technique and…o A discrete hill climb techniqueo Two for the price of one!

Where are HMMs used?o Speech recognition, malware detection,

IDS, and many, many more applications Why is it useful?

o Easy to apply and efficient algorithms

Revealing Introduction to HMMs

Page 3: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

3

Markov Chain

Markov chaino “Memoryless random process”o Transitions depend only on current

state (Markov chain of order 1)… o …and transition probability matrix

Example? o See next slide…

Revealing Introduction to HMMs

Page 4: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

4

Markov Chain Suppose we’re interested

in average annual temperatureo Only consider Hot and Cold

From recorded history, obtain these probabilitieso That is, from thermometer

readings for “recent” years

H

C

0.7

0.6

0.3 0.4

Revealing Introduction to HMMs

Page 5: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

5

Markov Chain Transition probability matrix

Matrix is denoted as A

Note, A is “row stochastic”

H

C

0.7

0.6

0.3 0.4

Revealing Introduction to HMMs

Page 6: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

6

Markov Chain Can also include

begin, end states Matrix for begin

state is πo In this example,

Note that π also row stochastic

H

C

0.7

0.6

0.3 0.4begin

end

0.6

0.4

Revealing Introduction to HMMs

Page 7: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

7

Hidden Markov Model HMM includes a Markov chain

o But the Markov process is “hidden”o So we can’t observe the Markov processo Instead, observe things that are

probabilistically related to hidden stateso It’s as if there is a “curtain” between

Markov chain and observations Example on next few slides…

Revealing Introduction to HMMs

Page 8: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

8

HMM Example Consider H/C annual temp example Suppose we want to know H or C

annual temperature in distant pasto Before thermometers were inventedo We only distinguish between H and C

We assume transition between Hot and Cold years is same as todayo Then the A matrix is known

Revealing Introduction to HMMs

Page 9: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

9

HMM Example Temp in past determined by Markov

processo But, we cannot observe temperature in past

We find evidence that tree ring size is related to temperatureo Look at historical data to find the connection

We only consider 3 tree ring sizeso Small, Medium, Large (S, M, L, respectively)

Measure tree ring sizes and recorded temperatures to determine relationship

Revealing Introduction to HMMs

Page 10: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

10

HMM Example We find that tree ring sizes and

temperature related by

This is known as the B matrix

The matrix B is also row stochastic

Revealing Introduction to HMMs

Page 11: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

11

HMM Example

Can we now find H/C temps in past?

We cannot measure (observe) temps

But we can measure tree ring sizes…

…and tree ring sizes related to tempso By probabilities in the B matrix

Can we say something intelligent about temperatures in distant past?

Revealing Introduction to HMMs

Page 12: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

12

HMM Notation A lot of notation is required

o Notation may be the most difficult part…

Revealing Introduction to HMMs

Page 13: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

13

HMM Notation

Note that for simplicity, observations taken from V = {0,1,…,M-1}

That is, The matrix A = {aij} is N x N, where

o The matrix B = {bj(k)} is N x M,

whereo

Revealing Introduction to HMMs

Page 14: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

14

HMM Example

Consider our temperature example… What are the observations?

o V = {0,1,2}, corresponding to S,M,L What are states of Markov process?

o Q = {H,C} What are A,B, π, and T?

o A,B, π on previous slideso T is number of tree rings measured

What are N and M?o N = 2 and M = 3

Revealing Introduction to HMMs

Page 15: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

15

Generic HMM Generic view of HMM

HMM defined by A,B, and π We denote HMM “model” as λ =

(A,B,π)

Revealing Introduction to HMMs

Page 16: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

16

HMM Example

Suppose that we observe tree ring sizeso For some 4 year period of interest: S,M,S,Lo Then = (0, 1, 0, 2)

Most likely (hidden) state sequence?o That is, most likely X = (x0, x1, x2, x3)

Let πx0 be prob. of starting in state x0

Note prob. of initial observation o And ax0,x1 is prob. of transition x0 to x1

And so on…

Revealing Introduction to HMMs

Page 17: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

17

HMM Example

Bottom line? We can compute P(X) for any X For X = (x0, x1, x2, x3) we have

Suppose we observe (0,1,0,2), then what is probability of, say, HHCC?

Plug into formula above to find

Revealing Introduction to HMMs

Page 18: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

18

HMM Example

Do same for all 4-state seq’s

We find that the winner is…o CCCH

Not so fast my friend!

Revealing Introduction to HMMs

Page 19: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

19

HMM Example

The path CCCH scores the highest In dynamic programming (DP), we

find highest scoring path But, in HMM we maximize

expected number of correct stateso Sometimes called “EM algorithm”o For “Expectation Maximization”

How does HMM work in this example?Revealing Introduction to HMMs

Page 20: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

20

HMM Example For first position…

o Sum probabilities for all paths that have H in 1st position, compare to sum of probs for paths with C in 1st position --- biggest wins

Repeat for each position and we find

Revealing Introduction to HMMs

Page 21: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

21

HMM Example

So, HMM solution gives us CHCH While DP solution is CCCH Which solution is better? Neither!

o Just using different definitions of “best”

Revealing Introduction to HMMs

Page 22: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

22

HMM Paradox?

HMM maximizes expected number of correct stateso Whereas DP chooses “best” overall path

Possible for HMM to choose a “path” that is impossibleo Could be a transition probability of 0

Cannot get impossible path with DP Is this a flaw with HMM?

o No, it’s a feature

Revealing Introduction to HMMs

Page 23: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

23

Probability of Observations

Table computed for

= (0,1,0,2) For this sequence,

P( ) = .000412 + .000035

+ .000706 + … + .000847

= left to the reader Similarly for other

observationsRevealing Introduction to HMMs

Page 24: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

24

HMM Model

An HMM is defined by the three matrices, A, B, and π

Note that M and N are implied, since they are the dimensions of matrices

So, we denote an HMM “model” as λ = (A,B,π)

Revealing Introduction to HMMs

Page 25: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

25

The Three Problems HMMs used to solve 3 problems Problem 1: Given a model λ = (A,B,π) and

observation sequence O, find P(O|λ)o That is, we can score an observation sequence to

see how well it fits a given model Problem 2: Given λ = (A,B,π) and O, find an

optimal state sequence (in HMM sense)o Uncover hidden part (like previous example)

Problem 3: Given O, N, and M, find the model λ that maximizes probability of Oo That is, train a model to fit observations

Revealing Introduction to HMMs

Page 26: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

26

HMMs in Practice

Typically, HMMs used as follows: Given an observation sequence…

o Assume a (hidden) Markov process exists Train a model based on observations

o This is Problem 3 (optimal N by trial and error) Then given a sequence of observations,

score it versus the modelo This is Problem 1: high score implies similar to

training data, low score implies it’s not

Revealing Introduction to HMMs

Page 27: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

27

HMMs in Practice

Previous slide gives sense in which HMM is a “machine learning” techniqueo To train model, we do not need to specify

anything except the parameter No “Best” N often found by trial and error

That is, we don’t think too mucho Just train HMM and then use ito Best of all, efficient algorithms for HMMs

Revealing Introduction to HMMs

Page 28: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

28

The Three Solutions We give detailed solutions to 3 problems

o Note: We must have efficient solutions The three problems:

o Problem 1: Score an observation sequence versus a given model

o Problem 2: Given a model, “uncover” hidden parto Problem 3: Given an observation sequence, train a

model Recall that we considered example for 2 and 1

o But direct solutions are very inefficient

Revealing Introduction to HMMs

Page 29: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

29

Solution 1 Score observations versus a given model

o Given model λ = (A,B,π) and observation sequence O=(O0,O1,…,OT-1), find P(O|λ)

Denote hidden states as X = (x0, x1, . . . , xT-1)

Then from definition of B,P(O|X,λ)=bx0(O0) bx1(O1) … bxT-1(OT-1)

And from definition of A and π,P(X|λ)=πx0 ax0,x1 ax1,x2 … axT-2,xT-1

Revealing Introduction to HMMs

Page 30: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

30

Solution 1 Elementary conditional probability fact:

P(O,X|λ) = P(O|X,λ) P(X|λ) Sum over all possible state sequences X,

P(O|λ) = Σ P(O,X|λ) = Σ P(O|X,λ) P(X|λ)= Σπx0bx0(O0)ax0,x1bx1(O1)…axT-2,xT-1bxT-1(OT-1)

This “works” but way too costly Requires about 2TNT multiplications

o Why? There better be a better way…

Revealing Introduction to HMMs

Page 31: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

31

Forward Algorithm

Instead, use forward algorithmo Or “alpha pass”

For t = 0,1,…,T-1 and i=0,1,…,N-1, letαt(i) = P(O0,O1,…,Ot,xt=qi|λ)

Probability of “partial sum” to t, and Markov process is in state qi at step t

Can be computed recursively, efficiently

Revealing Introduction to HMMs

Page 32: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

32

Forward Algorithm

Let α0(i) = πibi(O0) for i = 0,1,…,N-1 For t = 1,2,…,T-1 and i=0,1,…,N-1, let

αt(i) = (Σαt-1(j)aji)bi(Ot)o Where the sum is from j = 0 to N-1

From definition of αt(i) we see

P(O|λ) = ΣαT-1(i) o Where the sum is from i = 0 to N-1

This requires only N2T multiplications

Revealing Introduction to HMMs

Page 33: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

33

Solution 2

Given model, find hidden states o Given λ = (A,B,π) and O, find an optimal

state sequenceo Recall that optimal means “maximize

expected number of correct states”o In contrast, DP finds best scoring path

For temp/tree ring example, solved thiso But hopelessly inefficient approach

A better way: backward algorithmo Or “beta pass”

Revealing Introduction to HMMs

Page 34: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

34

Backward Algorithm

For t = 0,1,…,T-1 and i = 0,1,…,N-1, let βt(i) = P(Ot+1,Ot+2,…,OT-1|xt=qi,λ)

Probability of partial sum from t to end and Markov process in state qi at step t

Analogous to the forward algorithm As with forward algorithm, this can be

computed recursively and efficiently

Revealing Introduction to HMMs

Page 35: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

35

Backward Algorithm Let βT-1(i) = 1 for i = 0,1,…,N-1

For t = T-2,T-3, …,1 and i = 0,1,…,N-1, let

βt(i) = Σaijbj(Ot+1)βt+1(j)

Where the sum is from j = 0 to N-1

Revealing Introduction to HMMs

Page 36: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

36

Solution 2 For t = 1,2,…,T-1 and i=0,1,…,N-1 define

γt(i) = P(xt=qi|O,λ)o Most likely state at t is qi that maximizes γt(i)

Note that γt(i) = αt(i)βt(i)/P(O|λ)o And recall P(O|λ) = ΣαT-1(i)

The bottom line?o Forward algorithm solves Problem 1o Forward/backward algorithms solve Problem 2

Revealing Introduction to HMMs

Page 37: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

37

Solution 3 Train a model: Given O, N, and M,

find λ that maximizes probability of O

Here, iteratively adjust λ = (A,B,π) to better fit the given observations Oo The size of matrices are fixed (N and

M)o But elements of matrices can change

It is amazing that this works!o And even more amazing that it’s

efficient

Revealing Introduction to HMMs

Page 38: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

38

Solution 3 For t=0,1,…,T-2 and i,j in {0,1,…,N-1},

define “di-gammas” asγt(i,j) = P(xt=qi, xt+1=qj|O,λ)

Note γt(i,j) is prob of being in state qi at time t and transiting to state qj at t+1

Then γt(i,j) = αt(i)aijbj(Ot+1)βt+1(j)/P(O|λ) And γt(i) = Σγt(i,j)

o Where sum is from j = 0 to N – 1

Revealing Introduction to HMMs

Page 39: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

39

Model Re-estimation Given di-gammas and gammas… For i = 0,1,…,N-1 let πi = γ0(i) For i = 0,1,…,N-1 and j = 0,1,…,N-1

aij = Σγt(i,j)/Σγt(i)o Where both sums are from t = 0 to T-2

For j = 0,1,…,N-1 and k = 0,1,…,M-1 bj(k) = Σγt(j)/Σγt(j)o Both sums from from t = 0 to T-2 but only t for

which Ot = k are counted in numerator

Why does this work?

Revealing Introduction to HMMs

Page 40: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

40

Solution 3

To summarize…1. Initialize λ = (A,B,π) 2. Compute αt(i), βt(i), γt(i,j), γt(i)

3. Re-estimate the model λ = (A,B,π) 4. If P(O|λ) increases, goto 2

Revealing Introduction to HMMs

Page 41: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

41

Solution 3 Some fine points… Model initialization

o If we have a good guess for λ = (A,B,π) then we can use it for initialization

o If not, let πi ≈ 1/N, ai,j ≈ 1/N, bj(k) ≈ 1/Mo Subject to row stochastic conditionso But, do not initialize to uniform values

Stopping conditionso Stop after some number of iterations and/or…o Stop if increase in P(O|λ) is “small”

Revealing Introduction to HMMs

Page 42: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

42

HMM as Discrete Hill Climb

Algorithm on previous slides shows that HMM is a “discrete hill climb”

HMM consists of discrete parameterso Specifically, the elements of the matrices

And re-estimation process improves model by modifying parameterso So, process “climbs” toward improved

modelo This happens in a high-dimensional space

Revealing Introduction to HMMs

Page 43: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

43

Dynamic Programming

Brief detour… For λ = (A,B,π) as above, it’s easy

to define a dynamic program (DP) Executive summary:

o DP is forward algorithm, with “sum” replaced by “max”

Precise details on next few slides

Revealing Introduction to HMMs

Page 44: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

44

Dynamic Programming

Let δ0(i) = πi bi(O0) for i=0,1,…,N-1 For t=1,2,…,T-1 and i=0,1,…,N-1 compute

δt(i) = max (δt-1(j)aji)bi(Ot) o Where the max is over j in {0,1,…,N-1}

Note that at each t, the DP computes best path for each state, up to that point

So, probability of best path is max δT-1(j) This max gives the best probability

o Not the best path, for that, see next slide

Revealing Introduction to HMMs

Page 45: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

45

Dynamic Programming

To determine optimal patho While computing deltas, keep track of

pointers to previous stateo When finished, construct optimal path by

tracing back points For example, consider temp example:

recall that we observe (0,1,0,2) Probabilities for path of length 1:

These are the only “paths” of length 1

Revealing Introduction to HMMs

Page 46: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

46

Dynamic Programming

Probabilities for each path of length 2

Best path of length 2 ending with H is CH

Best path of length 2 ending with C is CC

Revealing Introduction to HMMs

Page 47: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

47

Dynamic Program

Continuing, we compute best path ending at H and C at each step

And save pointers --- why?

Revealing Introduction to HMMs

Page 48: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

48

Dynamic Program

Best final score is .002822o And, thanks to pointers, best path is CCCH

But what about underflow?o A serious problem in bigger cases

Revealing Introduction to HMMs

Page 49: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

49

Underflow Resistant DP

Common trick to prevent underflow:o Instead of multiplying probabilities…o …add logarithms of probabilities

Why does this work?o Because log(xy) = log x + log yo Adding logs does not tend to 0

Note that these logs are negative… …and we must avoid 0 probabilities

Revealing Introduction to HMMs

Page 50: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

50

Underflow Resistant DP

Underflow resistant DP algorithm: Let δ0(i) = log(πi bi(O0))

for i=0,1,…,N-1 For t=1,2,…,T-1 and i=0,1,…,N-1 compute

δt(i) = max (δt-1(j) + log(aji) + log(bi(Ot)))o Where the max is over j in {0,1,…,N-1}

And score of best path is max δT-1(j) As before, must also keep track of paths

Revealing Introduction to HMMs

Page 51: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

51

HMM Scaling

Trickier to prevent underflow in HMM We consider solution 3

o Since it includes solutions 1 and 2 Recall for t = 1,2,…,T-1, i=0,1,…,N-1,

αt(i) = (Σαt-1(j)aj,i)bi(Ot) The idea is to normalize alphas so

that they sum to 1o Algorithm on next slide

Revealing Introduction to HMMs

Page 52: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

52

HMM Scaling

Given αt(i) = (Σαt-1(j)aj,i)bi(Ot) Let a0(i) = α0(i) for i=0,1,…,N-1 Let c0 = 1/Σa0(j) For i = 0,1,…,N-1, let a0(i) = c0a0(i) This takes care of t = 0 case Algorithm continued on next slide…

Revealing Introduction to HMMs

Page 53: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

53

HMM Scaling

For t = 1,2,…,T-1 do the following: For i = 0,1,…,N-1,

at(i) = (Σat-1(j)aj,i)bi(Ot) Let ct = 1/Σat(j) For i = 0,1,…,N-1 let at(i) = ctat(i)

Revealing Introduction to HMMs

Page 54: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

54

HMM Scaling

Easy to show at(i) = c0c1…ct αt(i) (♯)o Simple proof by induction

So, c0c1…ct is scaling factor at step t Also, easy to show that

at(i) = αt(i)/Σαt(j) Which implies ΣaT-1(i) = 1 (♯♯)

Revealing Introduction to HMMs

Page 55: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

55

HMM Scaling By combining (♯) and (♯♯), we have

1 = ΣaT-1(i) = c0c1…cT-1 ΣαT-1(i)

= c0c1…cT-1 P(O|λ) Therefore, P(O|λ) = 1 / c0c1…cT-1

To avoid underflow, we computelog P(O|λ) = -Σ log(cj)o Where sum is from j = 0 to T-1

Revealing Introduction to HMMs

Page 56: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

56

HMM Scaling Similarly, scale betas as ctβt(i) For re-estimation,

o Compute γt(i,j) and γt(i) using original formulas, but with scaled alphas, betas

This gives us new values for λ = (A,B,π) “Easy exercise” to show re-estimate is

exact when scaled alphas and betas used Also, P(O|λ) cancels from formula

o Use log P(O|λ) = -Σ log(cj) to decide if iterate improves

Revealing Introduction to HMMs

Page 57: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

57

All Together Now Complete pseudo code for Solution 3 Given: (O0,O1,…,OT-1) and N and M Initialize: λ = (A,B,π)

o A is NxN, B is NxM and π is 1xNo πi ≈ 1/N, aij ≈ 1/N, bj(k) ≈ 1/M, each matrix row

stochastic, but not uniform Initialize:

o maxIters = max number of re-estimation stepso iters = 0o oldLogProb = -∞

Revealing Introduction to HMMs

Page 58: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

58

Forward Algorithm

Forward algorithmo With scaling

Revealing Introduction to HMMs

Page 59: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

59

Backward Algorithm Backward

algorithm or “beta pass”o With scaling

Note: same scaling factor as alphas

Revealing Introduction to HMMs

Page 60: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

60

Gammas Using scaled

alphas and betas

So formulas unchanged

Revealing Introduction to HMMs

Page 61: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

61

Re-Estimation Again, using

scaled gammas

So formulas unchanged

Revealing Introduction to HMMs

Page 62: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

62

Stopping Criteria Check that

probability increaseso In practice, wantlogProb >

oldLogProb + ε And don’t

exceed max iterations

Revealing Introduction to HMMs

Page 63: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

63

English Text Example

Suppose Martian arrives on eartho Sees written English texto Wants to learn something about ito Martians know about HMMs

So, strip out all non-letters, make all letters lower-caseo 27 symbols (26 letters, word-space)o Train HMM on long sequence of

symbolsRevealing Introduction to HMMs

Page 64: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

64

English Text

For first training case, initialize: o N = 2 and M = 27o Elements of A and π are all approx 1/2o Elements of B are each about 1/27

We use 50,000 symbols for training After 1st iter: log P(O|λ) ≈ -165097 After 100th iter: log P(O|λ) ≈ -137305

Revealing Introduction to HMMs

Page 65: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

65

English Text Matrices A and π converge to:

What does this tells us?o Started in hidden state 1 (not state 0)o And we know transition probabilities

between hidden states Nothing too interesting here

o We don’t (yet) know about hidden states

Revealing Introduction to HMMs

Page 66: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

66

English Text

What about B matrix?

This is very interestingo Why???

Revealing Introduction to HMMs

Page 67: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

67

A Security Application Suppose we want to detect metamorphic

computer viruseso Such viruses vary their internal structureo But function of malware stays sameo If sufficiently variable, standard signature

detection will fail Can we use HMM for detection?

o What to use as observation sequence?o Is there really a “hidden” Markov process?o What about N, M, and T?o How many Os needed for training, scoring?

Revealing Introduction to HMMs

Page 68: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

68

HMM for Metamorphic Detection

Set of “family” viruses into 2 subsets Extract opcodes from each virus Append opcodes from subset 1 to make

one long sequenceo Train HMM on opcode sequence (problem 3)o Obtain a model λ = (A,B,π)

Set threshold: score opcodes from files in subset 2 and “normal” files (problem 1)o Can you sets a threshold that separates sets?o If so, may have a viable detection method

Revealing Introduction to HMMs

Page 69: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

69

HMM for Metamorphic Detection

Virus detection results from recent papero Note the

separation This is

good!

Revealing Introduction to HMMs

Page 70: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

70

HMM Generalizations

Here, assumed Markov process of order 1o Current state depends only on previous state

and transition matrix A Can use higher order Markov process

o Current state depends on n previous stateso Higher order vs size of N ? “Depth” vs “width”

Can have A and B matrices depend on t HMM often combined with other

techniques (e.g., neural nets)

Revealing Introduction to HMMs

Page 71: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

71

Generalizations

In some cases, limitation of HMM is that position information is not usedo In many applications this is OK/desirableo In some apps, this is serious limitation

Bioinformatics applicationso DNA sequencing, protein alignment, etc.o Sequence alignment is crucialo Can use “profile HMMs” instead of

HMMs

Revealing Introduction to HMMs

Page 72: A Revealing Introduction to Hidden Markov Models Mark Stamp Revealing Introduction to HMMs 1.

72

References

M. Stamp, A revealing introduction to hidden Markov models

L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition

R.L. Cave & L.P. Neuwirth, Hidden Markov models for English

Revealing Introduction to HMMs