Hidden Markov Models Part 1:...
Transcript of Hidden Markov Models Part 1:...
Hidden Markov Models Part 1: Introduction
CSE 6363 – Machine Learning Vassilis Athitsos
Computer Science and Engineering Department University of Texas at Arlington
1
Modeling Sequential Data
• Suppose that we have weather data for several days.
– 𝑥1, 𝑥2, … , 𝑥𝑁
• Each 𝑥𝑖 is a 2D column vector.
– 𝑥𝑛 = 1 if it rains on day 𝑛.
– 𝑥𝑛 = 0 if it does not rain on day 𝑛 (we call that a "sunny day").
• We want to learn a model that predicts if it is going to rain or not on a certain day, based on this data.
• What options do we have?
– Lots, as usual in machine learning.
2
Predicting Rain – Assuming Independence
• One option is to assume that the weather in any day is independent of the weather in any previous day.
• Thus, 𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1 = 𝑝(𝑥𝑛)
• Then, how can we compute 𝑝(𝑥)?
• If the weather of the past tells us nothing about the weather of the next day, then we can simply use the data to calculate how often it rains.
𝑝(𝑥 = 1) = 𝑥𝑛𝑁𝑛=1
𝑁
• So, the probability that it rains on any day is simply the fraction of days in the training data when it rained. 3
Predicting Rain – Assuming Independence
• If the weather of the past tells us nothing about the weather of the next day, then we can simply use the data to calculate how often it rains.
𝑝(𝑥 = 1) = 𝑥𝑛𝑁𝑛=1
𝑁
• So, the probability that it rains on any day is simply the fraction of days in the training data when it rained.
• Advantages of this approach: – Easy to apply. Only one parameter estimated.
• Disadvantages: – Not using all the information in the data. Past weather does correlate
with the weather of the next day. 4
Predicting Rain – Modeling Dependence
• The other extreme is to assume that the weather of any day depends on the weather of the 𝐾 previous days.
• Thus, we have to learn the whole probability distribution 𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1 = 𝑝(𝑥𝑛 𝑥𝑛−𝐾, … , 𝑥𝑛−1
• Advantages of this approach: – Builds a more complex model, that can capture more information
about how past weather influences the weather of the next day.
• Disadvantages: – The amount of data that is needed to reliably learn such a distribution
is exponential to 𝐾.
– Even for relatively small values of 𝐾, like 𝐾 = 5, you may need thousands of training data to learn the probabilities reliably.
5
Predicting Rain – Markov Chain
𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1 = 𝑝(𝑥𝑛 𝑥𝑛−𝐾 , … , 𝑥𝑛−1
• This probabilistic model, where an observation depends on the preceding 𝐾 observations, is called an 𝑲-th order Markov Chain.
• 𝐾 = 0 leads to a model that is too simple and inaccurate (the weather of any day does not depend on the weather of the previous days).
• A large value of 𝐾 may require more training data than we have.
• Choosing a good value of 𝐾 depends on the application, and on the amount of training data.
6
Predicting Rain – 1st Order Markov Chain
• It is very common to use 1st Order Markov Chains to model temporal dependencies.
𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1 = 𝑝(𝑥𝑛 𝑥𝑛−1
• For the rain example, learning this model requires consists of estimating four values: – 𝑝(𝑥𝑛 = 0 𝑥𝑛−1 = 0 : probability of a sunny day after a sunny day.
– 𝑝(𝑥𝑛 = 1 𝑥𝑛−1 = 0 : probability of a rainy day after a sunny day.
– 𝑝(𝑥𝑛 = 0 𝑥𝑛−1 = 1 : probability of a sunny day after a rainy day.
– 𝑝(𝑥𝑛 = 1 𝑥𝑛−1 = 1 : probability of a rainy day after a rainy day.
7
Visualizing a 1st Order Markov Chain
• This is called a state transition diagram.
• There are two states: rain and no rain.
• There are four transition probabilities, defining the probability of the next state given the previous one. 8
Rainy day
Sunny day
p(sun after sun) p(rain after rain)
p(rain after sun)
p(sun after rain)
Hidden States
• In our previous example, a state ("rainy day" or "sunny day") is observable.
• When that day comes, you can observe and find out if that day is rainy or sunny.
– In those cases, the learning problem can be how to predict future states, before we see them.
• There are also cases where the states are hidden.
– We cannot directly observe the value of a state.
– However, we can observe some features that depend on the state, and that can help us estimate the state.
– In those cases, the learning problem can be how to figure out the values of the states, given the observations. 9
Tree Rings and Temperatures
• Tree growth rings are visible in a cross-section of the tree trunk.
• Every year, the tree grows a new ring on the outside.
• Counting the rings can tell us about the age of the tree.
• The width of each ring contains information about the weather conditions that year (temperature, moisture, …). 10
Source: Wikipedia
Modeling Tree Rings
• At this point, we stop worrying about the actual science of how exactly tree ring width correlates with climate.
• For the sake of illustration, we will make a simple assumption. – The tree ring tends to be wider when the average temperature for that
year is higher.
• So, the trunk of a 1,000 year-old tree gives us information about the mean temperature for each of the last 1,000 years.
• How do we model that information?
• We have two sequences: – Sequence of observations: a sequence of widths: 𝑥1, 𝑥2, … , 𝑥𝑁.
– Sequence of hidden states: a sequence of temperatures: 𝑧1, 𝑧2, … , 𝑧𝑁.
• We want to find the most likely sequence of state values 𝑧1, 𝑧2, … , 𝑧𝑁, given the observations 𝑥1, 𝑥2, … , 𝑥𝑁.
11
Modeling Tree Rings
• We have two sequences:
– Sequence of observations: a sequence of widths: 𝑥1, 𝑥2, … , 𝑥𝑁.
– Sequence of hidden states: a sequence of temperatures: 𝑧1, 𝑧2, … , 𝑧𝑁.
• We want to find the most likely sequence of state values 𝑧1, 𝑧2, … , 𝑧𝑁, given the observations 𝑥1, 𝑥2, … , 𝑥𝑁.
• Assume that we have training data. – Other sequences of tree ring widths, for which we know the corresponding
temperatures.
• What can we learn from this training data?
• One approach is to learn 𝑝(𝑧𝑛| 𝑥𝑛): the probability of the mean temperature 𝑧𝑛 for some year given the ring width 𝑥𝑛 for that year.
• Then, for each 𝑧𝑛 we pick the value maximizing 𝑝(𝑧𝑛| 𝑥𝑛).
• Can we build a better model than this? 12
Hidden Markov Model
• The previous model simply estimated 𝑝(𝑧 | 𝑥). – It ignored the fact that the mean temperature in a year depends on
the mean temperature of the previous year.
• Taking that dependency into account we can estimate temperatures with better accuracy.
• We can use the training data to learn a better model, as follows: – Learn 𝑝(𝑥 | 𝑧): the probability of a tree ring width given the mean
temperature for that year.
– Learn 𝑝(𝑧𝑛 |𝑧𝑛−1): the probability of mean temperature for a year given the mean temperature for the previous year.
• Such a model is called a Hidden Markov Model.
13
Hidden Markov Model
• A Hidden Markov Model (HMM) is a model for how sequential data evolves.
• An HMM makes the following assumptions:
– States are hidden.
– States are modeled as 1st order Markov Chains. That is:
𝑝(𝑧𝑛 𝑧1, … , 𝑧𝑛−1 = 𝑝(𝑧𝑛 𝑧𝑛−1
– Observation 𝑥𝑛 is conditionally independent of all other states and observations, given the value of state 𝑧𝑛. That is:
𝑝(𝑥𝑛 𝑥1, … , 𝑥𝑛−1, 𝑥𝑛+1, … , 𝑥𝑁, 𝑧1, … , 𝑧𝑛−1 = 𝑝(𝑥𝑛 𝑧𝑛
14
Hidden Markov Model
• Given the previous assumptions, an HMM consists of:
– A set of states 𝑠1, … , 𝑠𝐾. • In the tree ring example, the states can be intervals of
temperatures. For example, 𝑠𝑘 can be the state corresponding to the mean temperature (in Celsius) being in the 𝑘, 𝑘 + 1 interval.
15
Hidden Markov Model
• Given the previous assumptions, an HMM consists of:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘). • 𝜋𝑘 defines the probability that, when we are given a new set of
observations 𝑥1, … , 𝑥𝑁, the initial state 𝑧1 is equal to 𝑠𝑘.
• For the tree ring example, 𝜋𝑘 can be defined as the probability that the mean temperature in the first year is equal to 𝑘.
16
Hidden Markov Model
• Given the previous assumptions, an HMM consists of:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘).
– A state transition matrix 𝑨, of size 𝐾 × 𝐾, where
𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘)
• Values 𝐴𝑘,𝑗 are called transition probabilities.
• For the tree ring example, 𝐴𝑖,𝑗 is the conditional probability that the
mean temperature for a certain year is 𝑗, if the mean temperature in the previous year is 𝑘.
17
Hidden Markov Model
• Given the previous assumptions, an HMM consists of:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘).
– A state transition matrix 𝑨, of size 𝐾 × 𝐾, where
𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘)
– Observation probability functions, also called emission probabilities, defined as:
𝜑𝑘 𝑥 = 𝑝 𝑥𝑛 = 𝑥 𝑧𝑛 = 𝑠𝑘)
• For the tree ring example, 𝜑𝑘 𝑥 is the probability of getting ring width 𝑥 in a specific year, if the temperature for that year is 𝑘.
18
Visualizing the Tree Ring HMM
• Assumption: temperature discretized to four values, so that we have four state values.
• The vertices show the four states.
• The edges show legal transitions between states. 19
𝑠3 𝑠4
𝑠1 𝑠2
Visualizing the Tree Ring HMM
• The edges show legal transitions between states.
• Each directed edge has its own probability (not shown here).
• This is a fully connected model, where any state can follow any other state. An HMM does not have to be fully connected. 20
𝑠3 𝑠4
𝑠1 𝑠2
Joint Probability Model
• A fully specified HMM defines a joint probability function 𝑝(𝑋, 𝑍).
– 𝑋 is the sequence of observations 𝑥1, … , 𝑥𝑁.
– 𝑍 is the sequence of hidden state values 𝑧1, … , 𝑧𝑁.
𝑝 𝑋, 𝑍 = 𝑝 𝑥1, … , 𝑥𝑁 , 𝑧1, … , 𝑧𝑁
= 𝑝 𝑧1, … , 𝑧𝑁 𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
– Why? Because of the assumption that 𝑥𝑛 is conditionally independent of all other observations and states, given 𝑧𝑛.
21
Joint Probability Model
• A fully specified HMM defines a joint probability function 𝑝(𝑋, 𝑍).
𝑝 𝑋, 𝑍 = 𝑝 𝑧1, … , 𝑧𝑁 𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
= 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
– Why? Because states are modeled as 1st order Markov Chains, so that 𝑝(𝑧𝑛 𝑧1, … , 𝑧𝑛−1 = 𝑝(𝑧𝑛 𝑧𝑛−1 .
22
Joint Probability Model
• A fully specified HMM defines a joint probability function 𝑝(𝑋, 𝑍).
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
– 𝑝 𝑧1 is computed using values 𝜋𝑘.
– 𝑝 𝑧𝑛 𝑧𝑛−1) is computed using transition matrix 𝑨.
– 𝑝 𝑥𝑛 𝑧𝑛) is computed using observation probabilities 𝜑𝑘 𝑥 .
23
Modeling the Digit 2
• Suppose that we want to model the motion of a hand, as it traces on air the shape of the digit "2".
• Here is one possible model:
– We represent the shape of the digit "2" as five line segments.
– Each line segment corresponds to a hidden state.
– This gives us five hidden states.
• We will also have a special end state 𝑠6, which signifies "end of observations".
24
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Modeling the Digit 2
• Suppose that we want to model the motion of a hand, as it traces on air the shape of the digit "2".
• Here is one possible model:
– We represent the shape of the digit "2" as five line segments.
– Each line segment corresponds to a hidden state.
– We end up with five states, plus the end state.
• This HMM is a forward model:
– If 𝑧𝑛 = 𝑠𝑘, then 𝑧𝑛+1 = 𝑠𝑘 or 𝑧𝑛+1 = 𝑠𝑘+1.
– This is similar to the monotonicity rule in DTW. 25
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Modeling the Digit 2
• This HMM is a forward model:
– If 𝑧𝑛 = 𝑠𝑘, then 𝑧𝑛+1 = 𝑠𝑘 or 𝑧𝑛+1 = 𝑠𝑘+1.
• Therefore, 𝐴𝑘,𝑗 = 0, except when
𝑘 = 𝑗 or 𝑘 + 1 = 𝑗.
– Remember, 𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘).
• The feature vector at each video frame 𝑛 can be the displacement vector:
– The difference between the pixel location of the hand at frame 𝑛 and the pixel location of the hand at frame 𝑛 − 1.
26
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Modeling the Digit 2
• So, each observation 𝑥𝑛 is a 2D vector.
• It will be convenient to describe each 𝑥𝑛 with these two numbers: – Its length 𝑙𝑛, measured in pixels.
– Its orientation 𝜃𝑛, measured in degrees.
• Lengths 𝑙𝑛 come from a Gaussian distribution with mean 𝜇𝑙,𝑘 and variance 𝜎𝑙,𝑘 that depend on the state 𝑠𝑘.
• Orientations 𝜃𝑛 come from a Gaussian distribution with mean 𝜇𝜃,𝑘 and variance 𝜎𝜃,𝑘 that also depend on the state 𝑠𝑘.
27
𝑥9 𝑥10
Modeling the Digit 2
• The decisions we make so far are often made by a human designer of the system. – The number of states.
– The topology of the model (fully connected, forward, or other variations).
– The features that we want to use.
– The way to model observation probabilities (e.g., using Gaussians, Gaussian mixtures, histograms, etc).
• Once those decisions have been made, the actual probabilities are typically learned using training data.
– The initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘).
– The transition matrix 𝑨, where 𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘)
– The observation probabilities, 𝜑𝑘 𝑥 = 𝑝 𝑥𝑛 = 𝑥 𝑧𝑛 = 𝑠𝑘) 28
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Modeling the Digit 2
• The actual probabilities are typically learned using training data. – The initial state probability function 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘).
– The transition matrix 𝑨, where
𝐴𝑘,𝑗 = 𝑝 𝑧𝑛 = 𝑠𝑗 𝑧𝑛−1 = 𝑠𝑘).
– The observation probabilities 𝜑𝑘 𝑥 = 𝑝 𝑥𝑛 = 𝑥 𝑧𝑛 = 𝑠𝑘).
• Before we see the algorithm for learning these probabilities, we will first see how we can use an HMM after it has been trained. – Thus, after all these probabilities are estimated.
• To do that, we will look at an example where we just specify these probabilities manually.
29
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Defining the Probabilities
• An HMM is defined by specifying:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘.
– A state transition matrix 𝑨.
– Observation probability functions.
• In this case: – We have five states 𝑠1, … , 𝑠5.
– How do we define 𝜋𝑘?
30
𝑥9 𝑥10
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Defining the Probabilities
• An HMM is defined by specifying:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘.
– A state transition matrix 𝑨.
– Observation probability functions.
• In this case: – We have five states 𝑠1, … , 𝑠5.
– How do we define 𝜋𝑘?
– 𝜋1 = 1, and 𝜋𝑘 = 0 for 𝑘 > 1.
31
𝑥9 𝑥10
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Defining the Probabilities
• An HMM is defined by specifying:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘.
– A state transition matrix 𝑨.
– Observation probability functions.
• How do we define transition matrix 𝑨?
32
𝑥9 𝑥10
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Defining the Probabilities
• An HMM is defined by specifying:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘.
– A state transition matrix 𝑨.
– Observation probability functions.
• We need to decide on values for each 𝐴𝑘,𝑘.
– In this model, we spend more time on states 𝑠4 and 𝑠5 than on the other states.
– This can be modeled by having higher values for 𝐴4,4 and 𝐴5,5 than for 𝐴1,1, 𝐴2,2, 𝐴3,3.
– This way, if 𝑧𝑛 = 𝑠4, then 𝑧𝑛+1 is more likely to also be 𝑠4, and overall state 𝑠4 lasts longer than states 𝑠1, 𝑠2, 𝑠3.
33
𝑥9 𝑥10
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Defining the Probabilities
• An HMM is defined by specifying:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘.
– A state transition matrix 𝑨.
– Observation probability functions.
• We need to decide on values for each 𝐴𝑘,𝑘.
– In this model, we spend more time on states 𝑠4 and 𝑠5 than on the other states.
– Here is a set of values that can represent that:
𝐴1,1 = 0.4 𝐴2,2 = 0.4
𝐴3,3 = 0.4
𝐴4,4 = 0.8
𝐴5,5 = 0.7 34
𝑥9 𝑥10
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Defining the Probabilities
• An HMM is defined by specifying:
– A set of states 𝑠1, … , 𝑠𝐾.
– An initial state probability function 𝜋𝑘.
– A state transition matrix 𝑨.
– Observation probability functions.
• Here is the resulting transition matrix 𝑨: 0.4 0.6 0.0 0.0 0.0 0.00.0 0.4 0.6 0.0 0.0 0.00.0 0.0 0.4 0.6 0.0 0.00.0 0.0 0.0 0.8 0.2 0.00.0 0.0 0.0 0.0 0.7 0.30.0 0.0 0.0 0.0 0.0 0.0
35
𝑥9 𝑥10
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Defining the Probabilities
• As we said before, each observation 𝑥𝑛 is a 2D vector, described by 𝑙𝑛 and 𝜃𝑛: – 𝑙𝑛 is the length, measured in pixels.
– 𝜃𝑛 is the orientation, measured in degrees.
• We can model 𝑝(𝑙𝑛) as a Gaussian 𝑁𝑙, with: – mean 𝜇𝑙 = 10 pixels.
– variance 𝜎𝑙 = 1.5 pixels.
– Both the mean and the variance do not depend on the state.
• We can model 𝜃𝑛 as a Gaussian 𝑁𝜃 with:
– mean 𝜇𝜃,𝑘 that depends on the state 𝒔𝒌. Obviously, each state corresponds to moving at a different orientation.
– variance 𝜎𝜃 = 10 degrees. This way, 𝜎𝜃 does not depend on the state.
36
𝑥9 𝑥10
Defining the Probabilities
• We define observation probability functions 𝜑𝑘 as:
𝜑𝑘 𝑥𝑛 =1
𝜎𝑙 2𝜋 𝑒−𝑥−𝜇𝑙
2
2 𝜎𝑙2 1
𝜎𝜃 2𝜋 𝑒−𝑥−𝜇𝜃,𝑘
2
2 𝜎𝜃2
• For the parameters in the above formula, we (manually) pick these values:
37
𝑥9 𝑥10
– 𝜇𝜃,1 = 45 degrees.
– 𝜇𝜃,2 = 0 degrees.
– 𝜇𝜃,3 = −60 degrees. – 𝜇𝜃,4 = −120 degrees.
– 𝜇𝜃,5 = 0 degrees.
– 𝜇𝑙 = 10 pixels. – 𝜎𝑙 = 1.5 pixels. – 𝜎𝜃 = 10 degrees.
Defining the Probabilities
• As we said before, each observation 𝑥𝑛 is a 2D vector, described by 𝑙𝑛 and 𝜃𝑛: – 𝑙𝑛 is the length, measured in pixels.
– 𝜃𝑛 is the orientation, measured in degrees.
• We define observation probability functions 𝜑𝑘 as:
𝜑𝑘 𝑥𝑛 = 𝑁𝑙 𝑙𝑛 ∗ 𝑁𝜃,𝑘 𝜃𝑛
𝜑𝑘 𝑥𝑛 =1
𝜎𝑙 2𝜋 𝑒−𝑥−𝜇𝑙
2
2 𝜎𝑙2 1
𝜎𝜃 2𝜋 𝑒−𝑥−𝜇𝜃,𝑘
2
2 𝜎𝜃2
• Note: in the above formula for 𝜑𝑘 𝑥𝑛 , the only part that depends on the state 𝑠𝑘 is 𝜇𝜃,𝑘.
38
𝑥9 𝑥10
An HMM as a Generative Model
• If we have an HMM whose parameters have already been learned, we can use that HMM to generate data randomly sampled from the joint distribution defined by the HMM:
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
• We will now see how to jointly generate a random obsevation sequence 𝑥1, … , 𝑥𝑁 and a random hidden state sequence 𝑧1, … , 𝑧𝑁, based on the distribution 𝑝 𝑋, 𝑍 defined by the HMM.
39
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 =
𝑍 =
• Step 1: pick a random 𝑧1, based on initial state probabilities 𝜋𝑘.
• Remember: 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘)
• What values of 𝑧1 are legal in our example?
40
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 =
𝑍 = 𝑠1
• Step 1: pick a random 𝑧1, based on initial state probabilities 𝜋𝑘.
• Remember: 𝜋𝑘 = 𝑝(𝑧1 = 𝑠𝑘)
• What values of 𝑧1 are legal in our example?
• 𝜋𝑘 > 0 only for 𝑘 = 1.
• Therefore, it has to be that 𝑧1 = 𝑠1. 41
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 =
𝑍 = 𝑠1
• Next step: pick a random 𝑥1, based on observation probabilities 𝜑𝑘 𝑥 .
• Which 𝜑𝑘 should we use?
42
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 =
𝑍 = 𝑠1
• Next step: pick a random 𝑥1, based on observation probabilities 𝜑𝑘 𝑥 .
• We choose an 𝑙1 randomly from Gaussian 𝑁𝑙, with mean 10 pixels and variance 1.5 pixels.
• In Matlab, you can do this with this line: l1 = randn(1)*sqrt(1.5) + 10 43
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = (8.35, ? )
𝑍 = 𝑠1
• Next step: pick a random 𝑥1, based on observation probabilities 𝜑𝑘 𝑥 .
• We choose an 𝑙1 randomly from Gaussian 𝑁𝑙, with mean 10 pixels and variance 1.5 pixels.
• Result (obviously, will differ each time): 8.3 pixels.
44
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = (8.4,54)
𝑍 = 𝑠1
• Next step: pick a random 𝑥1, based on observation probabilities 𝜑𝑘 𝑥 .
• We choose a 𝜃1 randomly from Gaussian 𝑁𝜃,1, with mean 45 degrees and variance 10 degrees.
• Result (obviously, will differ each time): 54 degrees
45
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = (8.4,54)
𝑍 = 𝑠1
• Next step: pick a random 𝑧2.
• What distribution should we draw 𝑧2 from?
46
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = (8.4,54)
𝑍 = 𝑠1
• Next step: pick a random 𝑧2.
• What distribution should we draw 𝑧2 from?
• We should use 𝑝(𝑧2 | 𝑧1 = 𝑠1). Where is that stored?
47
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = (8.4,54)
𝑍 = 𝑠1
• Next step: pick a random 𝑧2.
• What distribution should we draw 𝑧2 from?
• We should use 𝑝(𝑧2 | 𝑧1 = 𝑠1). Where is that stored?
• On the first row of state transition matrix 𝑨.
48
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = (8.4,54)
𝑍 = 𝑠1, 𝑠2
• Next step: pick a random 𝑧2, from distribution 𝑝(𝑧2 | 𝑧1 = 𝑠1).
• The relevant values are:
– 𝐴1,1 = 𝑝(𝑧2 = 𝑠1 𝑧1 = 𝑠1 = 0.4
– 𝐴1,2 = 𝑝(𝑧2 = 𝑠2 𝑧1 = 𝑠1 = 0.6
• Picking randomly we get 𝑧2 = 𝑠2.
49
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = (8.4,54)
𝑍 = 𝑠1, 𝑠2
• Next step?
50
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , (10.8, ? )
𝑍 = 𝑠1, 𝑠2
• Next step: pick a random 𝑥2, based on observation density 𝜑2 𝑥 .
• We choose an 𝑙1 randomly from Gaussian 𝑁𝑙, with mean 10 pixels and variance 1.5 pixels.
• Result: 10.8 pixels.
51
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , (10.8,2)
𝑍 = 𝑠1, 𝑠2
• Next step: pick a random 𝑥2, based on observation density 𝜑2 𝑥 .
• We choose a 𝜃2 randomly from Gaussian 𝑁𝜃,2, with mean 0 degrees and variance 10 degrees.
• Result: 2 degrees
52
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , (10.8,2)
𝑍 = 𝑠1, 𝑠2
• Next step?
53
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , (10.8,2)
𝑍 = 𝑠1, 𝑠2, 𝑠2
• Next step: pick a random 𝑧3, from distribution 𝑝(𝑧3 | 𝑧2 = 𝑠2).
• The relevant values are:
– 𝐴2,2 = 𝑝(𝑧3 = 𝑠2 𝑧2 = 𝑠2 = 0.4
– 𝐴2,3 = 𝑝(𝑧3 = 𝑠3 𝑧2 = 𝑠2 = 0.6
• Picking randomly we get 𝑧3 = 𝑠2. 54
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , 10.8,2 , (11.3, ? )
𝑍 = 𝑠1, 𝑠2, 𝑠2
• Next step: pick a random 𝑥3, based on observation density 𝜑2 𝑥 .
• We choose an 𝑙1 randomly from Gaussian 𝑁𝑙, with mean 10 pixels and variance 1.5 pixels.
• Result: 11.3 pixels. 55
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , 10.8,2 , (11.3,−3)
𝑍 = 𝑠1, 𝑠2, 𝑠2
• Next step: pick a random 𝑥3, based on observation density 𝜑2 𝑥 .
• We choose a 𝜃2 randomly from Gaussian 𝑁𝜃,2, with mean 0 degrees and variance 10 degrees.
• Result: -3 degrees 56
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , 10.8,2 , (11.3,−3)
𝑍 = 𝑠1, 𝑠2, 𝑠2, 𝑠3
• Next step: pick a random 𝑧4, from distribution 𝑝(𝑧4 | 𝑧2 = 𝑠2).
• The relevant values are:
– 𝐴2,2 = 𝑝(𝑧4 = 𝑠2 𝑧3 = 𝑠2 = 0.4
– 𝐴2,3 = 𝑝(𝑧4 = 𝑠3 𝑧3 = 𝑠2 = 0.6
• Picking randomly we get 𝑧4 = 𝑠3. 57
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , 10.8,2 , 11.3,−3 ,…
𝑍 = 𝑠1, 𝑠2, 𝑠2, 𝑠3, …
• Overall, this is an iterative process. – pick randomly a new state 𝑧𝑛 = 𝑠𝑘, based on the state transition
probabilities.
– pick randomly a new observation 𝑥𝑛, based on observation density 𝜑𝑘 𝑥 .
• When do we stop? 58
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
Generating Random Data
𝑝 𝑋, 𝑍 = 𝑝 𝑧1 𝑝 𝑧𝑛 𝑧𝑛−1)
𝑁
𝑛=2
𝑝 𝑥𝑛 𝑧𝑛)
𝑁
𝑛=1
𝑋 = 8.4,54 , 10.8,2 , 11.3,−3 ,…
𝑍 = 𝑠1, 𝑠2, 𝑠2, 𝑠3, … , 𝑠6
• Overall, this is an iterative process. – pick randomly a new state 𝑧𝑛 = 𝑠𝑘, based on the state transition
probabilities.
– pick randomly a new observation 𝑥𝑛, based on observation density 𝜑𝑘 𝑥 .
• We stop when we get 𝑧𝑛 = 𝑠6, since 𝑠6 is the end state. 59
𝑠1 𝑠2 𝑠3
𝑠4
𝑠5 𝑠6
An Example of Synthetic Data
• The textbook shows this figure as an example of how synthetic data is generated.
• The top row shows some of the training images used to train a model of the digit "2". – Not identical to the model we described before, but along the same lines.
• The bottom row shows three examples of synthetic patterns, generated using the approach we just described.
• What do you notice in the synthetic data?
60
An Example of Synthetic Data
• The synthetic data is not very realistic.
• The problem is that some states last longer than they should and some states last shorter than they should.
• For example: – In the leftmost synthetic example, the top curve is too big relative to the
rest of the pattern.
– In the middle synthetic example, the diagonal line at the middle is too long.
– In the rightmost synthetic example, the top curve is too small relative to the bottom horizontal line. 61
An Example of Synthetic Data
• Why do we get this problem of disproportionate parts?
• As we saw earlier, each next state is chosen randomly, based on transition probabilities.
• There is no "memory" to say that, e.g.,, if the top curve is big (or small), the rest of the pattern should be proportional to that.
• This is the price we pay for the Markovian assumption, that the future is independent of the past, given the current state. – The benefit of the Markovian assumption is efficient learning and
classification algorithms, as we will see. 62
HMMs: Next Steps
• We have seen how HMMs are defined.
– Set of states.
– Initial state probabilities.
– State transition matrix.
– Observation probabilities.
• We have seen how an HMM defines a probability distribution 𝑝 𝑋, 𝑍 .
– We have also seen how to generate random samples from that distribution.
• Next we will see:
– How to use HMMs for various tasks, like classification.
– How to learn HMMs from training data.
63