Hidden Markov Models Part 1:...

Hidden Markov Models Part 1: Introduction

CSE 6363 – Machine Learning Vassilis Athitsos

Computer Science and Engineering Department University of Texas at Arlington

1

Modeling Sequential Data

• Suppose that we have weather data for several days.

– 𝑥1, 𝑥2, … , 𝑥𝑁

• Each 𝑥𝑖 is a 2D column vector.

– 𝑥𝑛 = 1 if it rains on day 𝑛.

– 𝑥𝑛 = 0 if it does not rain on day 𝑛 (we call that a "sunny day").

• We want to learn a model that predicts if it is going to rain or not on a certain day, based on this data.

• What options do we have?

– Lots, as usual in machine learning.

2

Predicting Rain – Assuming Independence

• One option is to assume that the weather in any day is independent of the weather in any previous day.

• Thus, 𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1 = 𝑝(𝑥𝑛)

• Then, how can we compute 𝑝(𝑥)?

• If the weather of the past tells us nothing about the weather of the next day, then we can simply use the data to calculate how often it rains.

𝑝(𝑥 = 1) = 𝑥𝑛𝑁𝑛=1

𝑁

• So, the probability that it rains on any day is simply the fraction of days in the training data when it rained. 3

Predicting Rain – Assuming Independence

• If the weather of the past tells us nothing about the weather of the next day, then we can simply use the data to calculate how often it rains.

𝑝(𝑥 = 1) = 𝑥𝑛𝑁𝑛=1

𝑁

• So, the probability that it rains on any day is simply the fraction of days in the training data when it rained.

• Advantages of this approach: – Easy to apply. Only one parameter estimated.

• Disadvantages: – Not using all the information in the data. Past weather does correlate

with the weather of the next day. 4

Predicting Rain – Modeling Dependence

• The other extreme is to assume that the weather of any day depends on the weather of the 𝐾 previous days.

• Thus, we have to learn the whole probability distribution 𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1 = 𝑝(𝑥𝑛 𝑥𝑛−𝐾, … , 𝑥𝑛−1

• Advantages of this approach: – Builds a more complex model, that can capture more information

about how past weather influences the weather of the next day.

• Disadvantages: – The amount of data that is needed to reliably learn such a distribution

is exponential to 𝐾.

– Even for relatively small values of 𝐾, like 𝐾 = 5, you may need thousands of training data to learn the probabilities reliably.

5

Predicting Rain – Markov Chain

𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1 = 𝑝(𝑥𝑛 𝑥𝑛−𝐾 , … , 𝑥𝑛−1

• This probabilistic model, where an observation depends on the preceding 𝐾 observations, is called an 𝑲-th order Markov Chain.

• 𝐾 = 0 leads to a model that is too simple and inaccurate (the weather of any day does not depend on the weather of the previous days).

• A large value of 𝐾 may require more training data than we have.

• Choosing a good value of 𝐾 depends on the application, and on the amount of training data.

6

Predicting Rain – 1st Order Markov Chain

• It is very common to use 1st Order Markov Chains to model temporal dependencies.

𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1 = 𝑝(𝑥𝑛 𝑥𝑛−1

• For the rain example, learning this model requires consists of estimating four values: – 𝑝(𝑥𝑛 = 0 𝑥𝑛−1 = 0 : probability of a sunny day after a sunny day.

– 𝑝(𝑥𝑛 = 1 𝑥𝑛−1 = 0 : probability of a rainy day after a sunny day.

– 𝑝(𝑥𝑛 = 0 𝑥𝑛−1 = 1 : probability of a sunny day after a rainy day.

– 𝑝(𝑥𝑛 = 1 𝑥𝑛−1 = 1 : probability of a rainy day after a rainy day.

7

Visualizing a 1st Order Markov Chain

• This is called a state transition diagram.

• There are two states: rain and no rain.

• There are four transition probabilities, defining the probability of the next state given the previous one. 8

Rainy day

Sunny day

p(sun after sun) p(rain after rain)

p(rain after sun)

p(sun after rain)

Hidden States

• In our previous example, a state ("rainy day" or "sunny day") is observable.

• When that day comes, you can observe and find out if that day is rainy or sunny.

– In those cases, the learning problem can be how to predict future states, before we see them.

• There are also cases where the states are hidden.

– We cannot directly observe the value of a state.

– However, we can observe some features that depend on the state, and that can help us estimate the state.

– In those cases, the learning problem can be how to figure out the values of the states, given the observations. 9

Tree Rings and Temperatures

• Tree growth rings are visible in a cross-section of the tree trunk.

• Every year, the tree grows a new ring on the outside.

• Counting the rings can tell us about the age of the tree.

• The width of each ring contains information about the weather conditions that year (temperature, moisture, …). 10

Source: Wikipedia

Modeling Tree Rings

• At this point, we stop worrying about the actual science of how exactly tree ring width correlates with climate.

• For the sake of illustration, we will make a simple assumption. – The tree ring tends to be wider when the average temperature for that

year is higher.

• So, the trunk of a 1,000 year-old tree gives us information about the mean temperature for each of the last 1,000 years.

• How do we model that information?

• We have two sequences: – Sequence of observations: a sequence of widths: 𝑥1, 𝑥2, … , 𝑥𝑁.

– Sequence of hidden states: a sequence of temperatures: 𝑧1, 𝑧2, … , 𝑧𝑁.

• We want to find the most likely sequence of state values 𝑧1, 𝑧2, … , 𝑧𝑁, given the observations 𝑥1, 𝑥2, … , 𝑥𝑁.

11

Modeling Tree Rings

• We have two sequences:

– Sequence of observations: a sequence of widths: 𝑥1, 𝑥2, … , 𝑥𝑁.

– Sequence of hidden states: a sequence of temperatures: 𝑧1, 𝑧2, … , 𝑧𝑁.

• We want to find the most likely sequence of state values 𝑧1, 𝑧2, … , 𝑧𝑁, given the observations 𝑥1, 𝑥2, … , 𝑥𝑁.

• Assume that we have training data. – Other sequences of tree ring widths, for which we know the corresponding

temperatures.

• What can we learn from this training data?

• One approach is to learn 𝑝(𝑧𝑛| 𝑥𝑛): the probability of the mean temperature 𝑧𝑛 for some year given the ring width 𝑥𝑛 for that year.

• Then, for each 𝑧𝑛 we pick the value maximizing 𝑝(𝑧𝑛| 𝑥𝑛).

• Can we build a better model than this? 12

Hidden Markov Model

• The previous model simply estimated 𝑝(𝑧 | 𝑥). – It ignored the fact that the mean temperature in a year depends on

the mean temperature of the previous year.

• Taking that dependency into account we can estimate temperatures with better accuracy.

• We can use the training data to learn a better model, as follows: – Learn 𝑝(𝑥 | 𝑧): the probability of a tree ring width given the mean

temperature for that year.

– Learn 𝑝(𝑧𝑛 |𝑧𝑛−1): the probability of mean temperature for a year given the mean temperature for the previous year.

• Such a model is called a Hidden Markov Model.

13

Hidden Markov Model

• A Hidden Markov Model (HMM) is a model for how sequential data evolves.

• An HMM makes the following assumptions:

– States are hidden.

– States are modeled as 1st order Markov Chains. That is:

𝑝(𝑧𝑛 𝑧1, … , 𝑧𝑛−1 = 𝑝(𝑧𝑛 𝑧𝑛−1

– Observation 𝑥𝑛 is conditionally independent of all other states and observations, given the value of state 𝑧𝑛. That is:

𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1, 𝑥𝑛+1, … , 𝑥𝑁, 𝑧1, … , 𝑧𝑛−1 = 𝑝(𝑥𝑛 𝑧𝑛

14

Hidden Markov Model

• Given the previous assumptions, an HMM consists of:

– A set of states 𝑠1, … , 𝑠𝐾. • In the tree ring example, the states can be intervals of

temperatures. For example, 𝑠𝑘 can be the state corresponding to the mean temperature (in Celsius) being in the 𝑘, 𝑘 + 1 interval.

15

Hidden Markov Model


– A set of states 𝑠1, … , 𝑠𝐾.

– An initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘). • 𝜋𝑘 defines the probability that, when we are given a new set of

observations 𝑥1, … , 𝑥𝑁, the initial state 𝑧1 is equal to 𝑠𝑘.

• For the tree ring example, 𝜋𝑘 can be defined as the probability that the mean temperature in the first year is equal to 𝑘.

16

Hidden Markov Model



– An initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘).

– A state transition matrix 𝑨, of size 𝐾 × 𝐾, where

𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘)

• Values 𝐴𝑘,𝑗 are called transition probabilities.

• For the tree ring example, 𝐴𝑖,𝑗 is the conditional probability that the

mean temperature for a certain year is 𝑗, if the mean temperature in the previous year is 𝑘.

17

Hidden Markov Model



– An initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘).

– A state transition matrix 𝑨, of size 𝐾 × 𝐾, where

𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘)

– Observation probability functions, also called emission probabilities, defined as:

𝜑𝑘 𝑥 = 𝑝 𝑥𝑛 = 𝑥 𝑧𝑛 = 𝑠𝑘)

• For the tree ring example, 𝜑𝑘 𝑥 is the probability of getting ring width 𝑥 in a specific year, if the temperature for that year is 𝑘.

18

Visualizing the Tree Ring HMM

• Assumption: temperature discretized to four values, so that we have four state values.

• The vertices show the four states.

• The edges show legal transitions between states. 19

𝑠3 𝑠4

𝑠1 𝑠2

Visualizing the Tree Ring HMM

• The edges show legal transitions between states.

• Each directed edge has its own probability (not shown here).

• This is a fully connected model, where any state can follow any other state. An HMM does not have to be fully connected. 20

𝑠3 𝑠4

𝑠1 𝑠2

Joint Probability Model

• A fully specified HMM defines a joint probability function 𝑝(𝑋, 𝑍).

– 𝑋 is the sequence of observations 𝑥1, … , 𝑥𝑁.

– 𝑍 is the sequence of hidden state values 𝑧1, … , 𝑧𝑁.

𝑝 𝑋, 𝑍 = 𝑝 𝑥1, … , 𝑥𝑁 , 𝑧1, … , 𝑧𝑁

= 𝑝 𝑧1, … , 𝑧𝑁 𝑝 𝑥𝑛 𝑧𝑛)

𝑁

𝑛=1

– Why? Because of the assumption that 𝑥𝑛 is conditionally independent of all other observations and states, given 𝑧𝑛.

21



𝑝 𝑋, 𝑍 = 𝑝 𝑧1, … , 𝑧𝑁 𝑝 𝑥𝑛 𝑧𝑛)

𝑁

𝑛=1

= 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)

𝑁

𝑛=2

𝑝 𝑥𝑛 𝑧𝑛)

𝑁

𝑛=1

– Why? Because states are modeled as 1st order Markov Chains, so that 𝑝(𝑧𝑛 𝑧1, … , 𝑧𝑛−1 = 𝑝(𝑧𝑛 𝑧𝑛−1 .

22



𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)

𝑁

𝑛=2


𝑁

𝑛=1

– 𝑝 𝑧1 is computed using values 𝜋𝑘.

– 𝑝 𝑧𝑛 𝑧𝑛−1) is computed using transition matrix 𝑨.

– 𝑝 𝑥𝑛 𝑧𝑛) is computed using observation probabilities 𝜑𝑘 𝑥 .

23

Modeling the Digit 2

• Suppose that we want to model the motion of a hand, as it traces on air the shape of the digit "2".

• Here is one possible model:

– We represent the shape of the digit "2" as five line segments.

– Each line segment corresponds to a hidden state.

– This gives us five hidden states.

• We will also have a special end state 𝑠6, which signifies "end of observations".

24

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6


• Suppose that we want to model the motion of a hand, as it traces on air the shape of the digit "2".

• Here is one possible model:

– We represent the shape of the digit "2" as five line segments.

– Each line segment corresponds to a hidden state.

– We end up with five states, plus the end state.

• This HMM is a forward model:

– If 𝑧𝑛 = 𝑠𝑘, then 𝑧𝑛+1 = 𝑠𝑘 or 𝑧𝑛+1 = 𝑠𝑘+1.

– This is similar to the monotonicity rule in DTW. 25

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6


• This HMM is a forward model:

– If 𝑧𝑛 = 𝑠𝑘, then 𝑧𝑛+1 = 𝑠𝑘 or 𝑧𝑛+1 = 𝑠𝑘+1.

• Therefore, 𝐴𝑘,𝑗 = 0, except when

𝑘 = 𝑗 or 𝑘 + 1 = 𝑗.

– Remember, 𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘).

• The feature vector at each video frame 𝑛 can be the displacement vector:

– The difference between the pixel location of the hand at frame 𝑛 and the pixel location of the hand at frame 𝑛 − 1.

26

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6


• So, each observation 𝑥𝑛 is a 2D vector.

• It will be convenient to describe each 𝑥𝑛 with these two numbers: – Its length 𝑙𝑛, measured in pixels.

– Its orientation 𝜃𝑛, measured in degrees.

• Lengths 𝑙𝑛 come from a Gaussian distribution with mean 𝜇𝑙,𝑘 and variance 𝜎𝑙,𝑘 that depend on the state 𝑠𝑘.

• Orientations 𝜃𝑛 come from a Gaussian distribution with mean 𝜇𝜃,𝑘 and variance 𝜎𝜃,𝑘 that also depend on the state 𝑠𝑘.

27

𝑥9 𝑥10


• The decisions we make so far are often made by a human designer of the system. – The number of states.

– The topology of the model (fully connected, forward, or other variations).

– The features that we want to use.

– The way to model observation probabilities (e.g., using Gaussians, Gaussian mixtures, histograms, etc).

• Once those decisions have been made, the actual probabilities are typically learned using training data.

– The initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘).

– The transition matrix 𝑨, where 𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘)

– The observation probabilities, 𝜑𝑘 𝑥 = 𝑝 𝑥𝑛 = 𝑥 𝑧𝑛 = 𝑠𝑘) 28

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6


• The actual probabilities are typically learned using training data. – The initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘).

– The transition matrix 𝑨, where

𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘).

– The observation probabilities 𝜑𝑘 𝑥 = 𝑝 𝑥𝑛 = 𝑥 𝑧𝑛 = 𝑠𝑘).

• Before we see the algorithm for learning these probabilities, we will first see how we can use an HMM after it has been trained. – Thus, after all these probabilities are estimated.

• To do that, we will look at an example where we just specify these probabilities manually.

29

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6

Defining the Probabilities

• An HMM is defined by specifying:


– An initial state probability function 𝜋𝑘.

– A state transition matrix 𝑨.

– Observation probability functions.

• In this case: – We have five states 𝑠1, … , 𝑠5.

– How do we define 𝜋𝑘?

30

𝑥9 𝑥10

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6







• In this case: – We have five states 𝑠1, … , 𝑠5.

– How do we define 𝜋𝑘?

– 𝜋1 = 1, and 𝜋𝑘 = 0 for 𝑘 > 1.

31

𝑥9 𝑥10

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6







• How do we define transition matrix 𝑨?

32

𝑥9 𝑥10

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6







• We need to decide on values for each 𝐴𝑘,𝑘.

– In this model, we spend more time on states 𝑠4 and 𝑠5 than on the other states.

– This can be modeled by having higher values for 𝐴4,4 and 𝐴5,5 than for 𝐴1,1, 𝐴2,2, 𝐴3,3.

– This way, if 𝑧𝑛 = 𝑠4, then 𝑧𝑛+1 is more likely to also be 𝑠4, and overall state 𝑠4 lasts longer than states 𝑠1, 𝑠2, 𝑠3.

33

𝑥9 𝑥10

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6







• We need to decide on values for each 𝐴𝑘,𝑘.

– In this model, we spend more time on states 𝑠4 and 𝑠5 than on the other states.

– Here is a set of values that can represent that:

𝐴1,1 = 0.4 𝐴2,2 = 0.4

𝐴3,3 = 0.4

𝐴4,4 = 0.8

𝐴5,5 = 0.7 34

𝑥9 𝑥10

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6







• Here is the resulting transition matrix 𝑨: 0.4 0.6 0.0 0.0 0.0 0.00.0 0.4 0.6 0.0 0.0 0.00.0 0.0 0.4 0.6 0.0 0.00.0 0.0 0.0 0.8 0.2 0.00.0 0.0 0.0 0.0 0.7 0.30.0 0.0 0.0 0.0 0.0 0.0

35

𝑥9 𝑥10

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6


• As we said before, each observation 𝑥𝑛 is a 2D vector, described by 𝑙𝑛 and 𝜃𝑛: – 𝑙𝑛 is the length, measured in pixels.

– 𝜃𝑛 is the orientation, measured in degrees.

• We can model 𝑝(𝑙𝑛) as a Gaussian 𝑁𝑙, with: – mean 𝜇𝑙 = 10 pixels.

– variance 𝜎𝑙 = 1.5 pixels.

– Both the mean and the variance do not depend on the state.

• We can model 𝜃𝑛 as a Gaussian 𝑁𝜃 with:

– mean 𝜇𝜃,𝑘 that depends on the state 𝒔𝒌. Obviously, each state corresponds to moving at a different orientation.

– variance 𝜎𝜃 = 10 degrees. This way, 𝜎𝜃 does not depend on the state.

36

𝑥9 𝑥10


• We define observation probability functions 𝜑𝑘 as:

𝜑𝑘 𝑥𝑛 =1

𝜎𝑙 2𝜋 𝑒−𝑥−𝜇𝑙

2

2 𝜎𝑙2 1

𝜎𝜃 2𝜋 𝑒−𝑥−𝜇𝜃,𝑘

2

2 𝜎𝜃2

• For the parameters in the above formula, we (manually) pick these values:

37

𝑥9 𝑥10

– 𝜇𝜃,1 = 45 degrees.


– 𝜇𝜃,3 = −60 degrees. – 𝜇𝜃,4 = −120 degrees.


– 𝜇𝑙 = 10 pixels. – 𝜎𝑙 = 1.5 pixels. – 𝜎𝜃 = 10 degrees.


• As we said before, each observation 𝑥𝑛 is a 2D vector, described by 𝑙𝑛 and 𝜃𝑛: – 𝑙𝑛 is the length, measured in pixels.

– 𝜃𝑛 is the orientation, measured in degrees.

• We define observation probability functions 𝜑𝑘 as:

𝜑𝑘 𝑥𝑛 = 𝑁𝑙 𝑙𝑛 ∗ 𝑁𝜃,𝑘 𝜃𝑛

𝜑𝑘 𝑥𝑛 =1

𝜎𝑙 2𝜋 𝑒−𝑥−𝜇𝑙

2

2 𝜎𝑙2 1

𝜎𝜃 2𝜋 𝑒−𝑥−𝜇𝜃,𝑘

2

2 𝜎𝜃2

• Note: in the above formula for 𝜑𝑘 𝑥𝑛 , the only part that depends on the state 𝑠𝑘 is 𝜇𝜃,𝑘.

38

𝑥9 𝑥10

An HMM as a Generative Model

• If we have an HMM whose parameters have already been learned, we can use that HMM to generate data randomly sampled from the joint distribution defined by the HMM:


𝑁

𝑛=2


𝑁

𝑛=1

• We will now see how to jointly generate a random obsevation sequence 𝑥1, … , 𝑥𝑁 and a random hidden state sequence 𝑧1, … , 𝑧𝑁, based on the distribution 𝑝 𝑋, 𝑍 defined by the HMM.

39

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6

Generating Random Data


𝑁

𝑛=2


𝑁

𝑛=1

𝑋 =

𝑍 =

• Step 1: pick a random 𝑧1, based on initial state probabilities 𝜋𝑘.

• Remember: 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘)

• What values of 𝑧1 are legal in our example?

40

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 =

𝑍 = 𝑠1

• Step 1: pick a random 𝑧1, based on initial state probabilities 𝜋𝑘.

• Remember: 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘)

• What values of 𝑧1 are legal in our example?

• 𝜋𝑘 > 0 only for 𝑘 = 1.

• Therefore, it has to be that 𝑧1 = 𝑠1. 41

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 =

𝑍 = 𝑠1

• Next step: pick a random 𝑥1, based on observation probabilities 𝜑𝑘 𝑥 .

• Which 𝜑𝑘 should we use?

42

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 =

𝑍 = 𝑠1


• We choose an 𝑙1 randomly from Gaussian 𝑁𝑙, with mean 10 pixels and variance 1.5 pixels.

• In Matlab, you can do this with this line: l1 = randn(1)*sqrt(1.5) + 10 43

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = (8.35, ? )

𝑍 = 𝑠1



• Result (obviously, will differ each time): 8.3 pixels.

44

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = (8.4,54)

𝑍 = 𝑠1


• We choose a 𝜃1 randomly from Gaussian 𝑁𝜃,1, with mean 45 degrees and variance 10 degrees.

• Result (obviously, will differ each time): 54 degrees

45

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = (8.4,54)

𝑍 = 𝑠1

• Next step: pick a random 𝑧2.

• What distribution should we draw 𝑧2 from?

46

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = (8.4,54)

𝑍 = 𝑠1



• We should use 𝑝(𝑧2 | 𝑧1 = 𝑠1). Where is that stored?

47

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = (8.4,54)

𝑍 = 𝑠1



• We should use 𝑝(𝑧2 | 𝑧1 = 𝑠1). Where is that stored?

• On the first row of state transition matrix 𝑨.

48

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = (8.4,54)

𝑍 = 𝑠1, 𝑠2

• Next step: pick a random 𝑧2, from distribution 𝑝(𝑧2 | 𝑧1 = 𝑠1).

• The relevant values are:

– 𝐴1,1 = 𝑝(𝑧2 = 𝑠1 𝑧1 = 𝑠1 = 0.4

– 𝐴1,2 = 𝑝(𝑧2 = 𝑠2 𝑧1 = 𝑠1 = 0.6

• Picking randomly we get 𝑧2 = 𝑠2.

49

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = (8.4,54)

𝑍 = 𝑠1, 𝑠2

• Next step?

50

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , (10.8, ? )

𝑍 = 𝑠1, 𝑠2

• Next step: pick a random 𝑥2, based on observation density 𝜑2 𝑥 .


• Result: 10.8 pixels.

51

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , (10.8,2)

𝑍 = 𝑠1, 𝑠2



• Result: 2 degrees

52

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , (10.8,2)

𝑍 = 𝑠1, 𝑠2

• Next step?

53

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , (10.8,2)

𝑍 = 𝑠1, 𝑠2, 𝑠2



– 𝐴2,2 = 𝑝(𝑧3 = 𝑠2 𝑧2 = 𝑠2 = 0.4

– 𝐴2,3 = 𝑝(𝑧3 = 𝑠3 𝑧2 = 𝑠2 = 0.6

• Picking randomly we get 𝑧3 = 𝑠2. 54

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , 10.8,2 , (11.3, ? )

𝑍 = 𝑠1, 𝑠2, 𝑠2



• Result: 11.3 pixels. 55

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , 10.8,2 , (11.3,−3)

𝑍 = 𝑠1, 𝑠2, 𝑠2



• Result: -3 degrees 56

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , 10.8,2 , (11.3,−3)

𝑍 = 𝑠1, 𝑠2, 𝑠2, 𝑠3



– 𝐴2,2 = 𝑝(𝑧4 = 𝑠2 𝑧3 = 𝑠2 = 0.4

– 𝐴2,3 = 𝑝(𝑧4 = 𝑠3 𝑧3 = 𝑠2 = 0.6

• Picking randomly we get 𝑧4 = 𝑠3. 57

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , 10.8,2 , 11.3,−3 ,…

𝑍 = 𝑠1, 𝑠2, 𝑠2, 𝑠3, …

• Overall, this is an iterative process. – pick randomly a new state 𝑧𝑛 = 𝑠𝑘, based on the state transition

probabilities.

– pick randomly a new observation 𝑥𝑛, based on observation density 𝜑𝑘 𝑥 .

• When do we stop? 58

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6



𝑁

𝑛=2


𝑁

𝑛=1

𝑋 = 8.4,54 , 10.8,2 , 11.3,−3 ,…

𝑍 = 𝑠1, 𝑠2, 𝑠2, 𝑠3, … , 𝑠6

• Overall, this is an iterative process. – pick randomly a new state 𝑧𝑛 = 𝑠𝑘, based on the state transition

probabilities.

– pick randomly a new observation 𝑥𝑛, based on observation density 𝜑𝑘 𝑥 .

• We stop when we get 𝑧𝑛 = 𝑠6, since 𝑠6 is the end state. 59

𝑠1 𝑠2 𝑠3

𝑠4

𝑠5 𝑠6

An Example of Synthetic Data

• The textbook shows this figure as an example of how synthetic data is generated.

• The top row shows some of the training images used to train a model of the digit "2". – Not identical to the model we described before, but along the same lines.

• The bottom row shows three examples of synthetic patterns, generated using the approach we just described.

• What do you notice in the synthetic data?

60


• The synthetic data is not very realistic.

• The problem is that some states last longer than they should and some states last shorter than they should.

• For example: – In the leftmost synthetic example, the top curve is too big relative to the

rest of the pattern.

– In the middle synthetic example, the diagonal line at the middle is too long.

– In the rightmost synthetic example, the top curve is too small relative to the bottom horizontal line. 61


• Why do we get this problem of disproportionate parts?

• As we saw earlier, each next state is chosen randomly, based on transition probabilities.

• There is no "memory" to say that, e.g.,, if the top curve is big (or small), the rest of the pattern should be proportional to that.

• This is the price we pay for the Markovian assumption, that the future is independent of the past, given the current state. – The benefit of the Markovian assumption is efficient learning and

classification algorithms, as we will see. 62

HMMs: Next Steps

• We have seen how HMMs are defined.

– Set of states.

– Initial state probabilities.

– State transition matrix.

– Observation probabilities.

• We have seen how an HMM defines a probability distribution 𝑝 𝑋, 𝑍 .

– We have also seen how to generate random samples from that distribution.

• Next we will see:

– How to use HMMs for various tasks, like classification.

– How to learn HMMs from training data.

63

Hidden Markov Models Part 1:...

Documents

Transcript of Hidden Markov Models Part 1:...