Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.
-
Upload
conrad-haynes -
Category
Documents
-
view
217 -
download
0
Transcript of Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.
Part of Speech Taggingamp Hidden Markov Models
Mitch Marcus
CSE 391
CIS 391 - Intro to AI 2
NLP Task I ndash Determining Part of Speech Tags
The Problem
Word POS listing in Brown Corpus
heat noun verb
oil noun
in prep noun adv
a det noun noun-proper
large adj noun adv
pot noun
CIS 391 - Intro to AI 3
NLP Task I ndash Determining Part of Speech Tags
The Old Solution Depth First search bull If each of n words has k tags on average try the
nk combinations until one works
Machine Learning Solutions Automatically learn Part of Speech (POS) assignmentbull The best techniques achieve 97+ accuracy per word on
new materials given large training corpora
CIS 391 - Intro to AI 4
What is POS tagging good for
Speech synthesisbull How to pronounce ldquoleadrdquobull INsult inSULTbull OBject obJECTbull OVERflow overFLOWbull DIScount disCOUNTbull CONtent conTENT
Stemming for information retrievalbull Knowing a word is a N tells you it gets pluralsbull Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etcbull Possessive pronouns (my your her) followed by nounsbull Personal pronouns (I you he) likely to be followed by verbs
CIS 391 - Intro to AI 5
Equivalent Problem in Bioinformatics Durbin et al Biological Sequence
Analysis Cambridge University Press
Several applications eg proteins From primary structure
ATCPLELLLD Infer secondary structure
HHHBBBBBC
CIS 391 - Intro to AI 6
Penn Treebank Tagset I
Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings
CIS 391 - Intro to AI 7
Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh
Penn Treebank Tagset II
CIS 391 - Intro to AI 8
Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when
Penn Treebank Tagset III
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 2
NLP Task I ndash Determining Part of Speech Tags
The Problem
Word POS listing in Brown Corpus
heat noun verb
oil noun
in prep noun adv
a det noun noun-proper
large adj noun adv
pot noun
CIS 391 - Intro to AI 3
NLP Task I ndash Determining Part of Speech Tags
The Old Solution Depth First search bull If each of n words has k tags on average try the
nk combinations until one works
Machine Learning Solutions Automatically learn Part of Speech (POS) assignmentbull The best techniques achieve 97+ accuracy per word on
new materials given large training corpora
CIS 391 - Intro to AI 4
What is POS tagging good for
Speech synthesisbull How to pronounce ldquoleadrdquobull INsult inSULTbull OBject obJECTbull OVERflow overFLOWbull DIScount disCOUNTbull CONtent conTENT
Stemming for information retrievalbull Knowing a word is a N tells you it gets pluralsbull Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etcbull Possessive pronouns (my your her) followed by nounsbull Personal pronouns (I you he) likely to be followed by verbs
CIS 391 - Intro to AI 5
Equivalent Problem in Bioinformatics Durbin et al Biological Sequence
Analysis Cambridge University Press
Several applications eg proteins From primary structure
ATCPLELLLD Infer secondary structure
HHHBBBBBC
CIS 391 - Intro to AI 6
Penn Treebank Tagset I
Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings
CIS 391 - Intro to AI 7
Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh
Penn Treebank Tagset II
CIS 391 - Intro to AI 8
Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when
Penn Treebank Tagset III
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 3
NLP Task I ndash Determining Part of Speech Tags
The Old Solution Depth First search bull If each of n words has k tags on average try the
nk combinations until one works
Machine Learning Solutions Automatically learn Part of Speech (POS) assignmentbull The best techniques achieve 97+ accuracy per word on
new materials given large training corpora
CIS 391 - Intro to AI 4
What is POS tagging good for
Speech synthesisbull How to pronounce ldquoleadrdquobull INsult inSULTbull OBject obJECTbull OVERflow overFLOWbull DIScount disCOUNTbull CONtent conTENT
Stemming for information retrievalbull Knowing a word is a N tells you it gets pluralsbull Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etcbull Possessive pronouns (my your her) followed by nounsbull Personal pronouns (I you he) likely to be followed by verbs
CIS 391 - Intro to AI 5
Equivalent Problem in Bioinformatics Durbin et al Biological Sequence
Analysis Cambridge University Press
Several applications eg proteins From primary structure
ATCPLELLLD Infer secondary structure
HHHBBBBBC
CIS 391 - Intro to AI 6
Penn Treebank Tagset I
Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings
CIS 391 - Intro to AI 7
Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh
Penn Treebank Tagset II
CIS 391 - Intro to AI 8
Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when
Penn Treebank Tagset III
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 4
What is POS tagging good for
Speech synthesisbull How to pronounce ldquoleadrdquobull INsult inSULTbull OBject obJECTbull OVERflow overFLOWbull DIScount disCOUNTbull CONtent conTENT
Stemming for information retrievalbull Knowing a word is a N tells you it gets pluralsbull Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo
Parsing and speech recognition and etcbull Possessive pronouns (my your her) followed by nounsbull Personal pronouns (I you he) likely to be followed by verbs
CIS 391 - Intro to AI 5
Equivalent Problem in Bioinformatics Durbin et al Biological Sequence
Analysis Cambridge University Press
Several applications eg proteins From primary structure
ATCPLELLLD Infer secondary structure
HHHBBBBBC
CIS 391 - Intro to AI 6
Penn Treebank Tagset I
Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings
CIS 391 - Intro to AI 7
Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh
Penn Treebank Tagset II
CIS 391 - Intro to AI 8
Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when
Penn Treebank Tagset III
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 5
Equivalent Problem in Bioinformatics Durbin et al Biological Sequence
Analysis Cambridge University Press
Several applications eg proteins From primary structure
ATCPLELLLD Infer secondary structure
HHHBBBBBC
CIS 391 - Intro to AI 6
Penn Treebank Tagset I
Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings
CIS 391 - Intro to AI 7
Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh
Penn Treebank Tagset II
CIS 391 - Intro to AI 8
Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when
Penn Treebank Tagset III
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 6
Penn Treebank Tagset I
Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings
CIS 391 - Intro to AI 7
Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh
Penn Treebank Tagset II
CIS 391 - Intro to AI 8
Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when
Penn Treebank Tagset III
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 7
Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh
Penn Treebank Tagset II
CIS 391 - Intro to AI 8
Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when
Penn Treebank Tagset III
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 8
Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when
Penn Treebank Tagset III
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 9
Simple Statistical Approaches Idea 1
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 10
Simple Statistical Approaches Idea 2
For a string of words
W = w1w2w3hellipwn
find the string of POS tags
T = t1 t2 t3 helliptn
which maximizes P(T|W)
bull ie the most likely POS tag ti for each word wi given its surrounding context
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 11
The Sparse Data Problem hellip
A Simple Impossible Approach to Compute P(T|W)
Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 12
A BOTEC Estimate of What We Can Estimate
What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech
tags
Rich Models often require vast amounts of data Good estimates of models with bad assumptions often
outperform better models which are badly estimated
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 13
A Practical Statistical Tagger
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 14
A Practical Statistical Tagger II
But we cant accurately estimate more than tag bigrams or sohellip
Again we change to a model that we CAN estimate
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 15
A Practical Statistical Tagger III
So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 16
Training and Performance
To estimate the parameters of this model given an annotated training corpus
Because many of these counts are small smoothing is necessary for best resultshellip
Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 17
Hidden Markov Models
This model is an instance of a Hidden Markov Model Viewed graphically
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1P(w|Det)
a 4the 4
P(w|Adj)good 02low 04
P(w|Noun)price 001deal 0001
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 18
Viewed as a generator an HMM
Adj
3
6Det
02
47 Noun
3
7 Verb
51 1
4the
4a
P(w|Det)
04low
02good
P(w|Adj)
0001deal
001price
P(w|Noun)
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 19
Recognition using an HMM
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 20
A Practical Statistical Tagger IV
Finding this maximum can be done using an exponential search through all strings for T
However there is a linear timelinear time solution using dynamic programming called Viterbi decoding
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 21
Parameters of an HMM
States A set of states S=s1hellipsn
Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj
Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si
Initial state distribution is the probability that si is a start state
i
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 22
The Three Basic HMM Problems
Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model
Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model
how do we find the state sequence that best explains the observations
(AB )
(AB )
(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 23
Problem 3 (Learning) How do we adjust the model parameters to maximize
The Three Basic HMM Problems
(AB )
P(O | )
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 24
Problem 1 Probability of an Observation Sequence
What is The probability of a observation sequence is the
sum of the probabilities of all possible state sequences in the HMM
Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences
Even small HMMs eg T=10 and N=10 contain 10 billion different paths
Solution to this and problem 2 is to use dynamic programming
P(O | )
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 25
The Trellis
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 26
Forward Probabilities
What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated
t (i) P(o1 ot qt si | )
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 27
Forward Probabilities
t ( j) t 1(i) aij
i1
N
b j (ot )
t (i) P(o1ot qt si | )
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 28
Forward Algorithm
Initialization
Induction
Termination
t ( j) t 1(i) aij
i1
N
b j (ot ) 2 t T1 j N
1(i) ibi(o1) 1i N
P(O | ) T (i)i1
N
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 29
Forward Algorithm Complexity
Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming
takes O(N2T) computations
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 30
Backward Probabilities
What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated
Analogous to forward probability just in the other direction
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 31
Backward Probabilities
t (i) aijb j (ot1)t1( j)j1
N
t (i) P(ot1oT | qt si)
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 32
Backward Algorithm
Initialization
Induction
Termination
T (i) 1 1i N
t (i) aijb j (ot1)t1( j)j1
N
t T 111i N
P(O | ) i 1(i)i1
N
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 33
Problem 2 Decoding
The Forward algorithm gives the sum of all paths through an HMM efficiently
Here we want to find the highest probability path
We want to find the state sequence Q=q1hellipqT such that
Q argmaxQ
P(Q | O)
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 34
Viterbi Algorithm
Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum
Forward
Viterbi Recursion
t ( j) t 1(i)aij
i1
N
b j (ot )
t ( j) max1iN
t 1(i)aij b j (ot )
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 35
Core Idea of Viterbi Algorithm
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 36
Viterbi Algorithm
Initialization Induction
Termination
Read out path
1(i) ib j (o1) 1i N
t ( j) max1iN
t 1(i) aij b j (ot )
t ( j) argmax1iN
t 1(i) aij
2 t T1 j N
p max1iN
T (i)
qT argmax
1iNT (i)
qt t1(qt1
) t T 11
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 37
Problem 3 Learning
Up to now wersquove assumed that we know the underlying model
Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data
We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that
(AB )
argmax
P(O | )
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 38
Problem 3 Learning (If Time Allowshellip)
Unfortunately there is no known way to analytically find a global maximum ie a model such that
But it is possible to find a local maximum
Given an initial model we can always find a model such that
argmax
P(O | )
P(O | ) P(O | )
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 39
Forward-Backward (Baum-Welch) algorithm
Key Idea parameter re-estimation by hill-climbing
From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 40
Parameter Re-estimation
Three parameters need to be re-estimatedbull Initial state distribution
bull Transition probabilities aij
bull Emission probabilities bi(ot)
i
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 41
Re-estimating Transition Probabilities
Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 42
Re-estimating Transition Probabilities
t (i j) t (i) ai j b j (ot1) t1( j)
t (i) ai j b j (ot1) t1( j)j1
N
i1
N
t (i j) P(qt si qt1 s j | O)
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 43
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formallyi
ji
ji s statefrom stransition of number expected
s stateto s statefrom stransition of number expecteda =
ˆ a i j t (i j)
t1
T 1
t (i j )j 1
N
t1
T 1
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 44
Re-estimating Transition Probabilities
Defining
As the probability of being in state si given the complete observation O
We can say
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
t (i) t (i j)j1
N
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 45
Re-estimating Initial State Probabilities
Initial state distribution is the probability that si is a start state
Re-estimation is easy
Formally
i
1 time at s statein times of number expectedπ ii =
ˆ i 1(i)
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 46
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as
Formally
where Note that here is the Kronecker delta function and
is not related to the in the discussion of the Viterbi algorithm
i
kii s statein times of number expected
v symbolobserve and s statein times of number expected)k(b =
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
(ot vk ) 1 if ot vk and 0 otherwise
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 47
The Updated Model
Coming from we get to
by the following update rules
(AB )
( ˆ A ˆ B ˆ )
ˆ b i(k) (ot vk )t (i)
t1
T
t (i)t1
T
ˆ a i j t (i j)
t1
T 1
t (i)t1
T 1
ˆ i 1(i)
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-
CIS 391 - Intro to AI 48
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward
probabilities for a give modelbull The M Step Re-estimate the model parameters
- Part of Speech Tagging amp Hidden Markov Models
- NLP Task I ndash Determining Part of Speech Tags
- Slide 3
- What is POS tagging good for
- Equivalent Problem in Bioinformatics
- Penn Treebank Tagset I
- Slide 7
- Slide 8
- Simple Statistical Approaches Idea 1
- Simple Statistical Approaches Idea 2
- The Sparse Data Problem hellip
- A BOTEC Estimate of What We Can Estimate
- A Practical Statistical Tagger
- A Practical Statistical Tagger II
- A Practical Statistical Tagger III
- Training and Performance
- Hidden Markov Models
- Viewed as a generator an HMM
- Recognition using an HMM
- A Practical Statistical Tagger IV
- Parameters of an HMM
- The Three Basic HMM Problems
- Slide 23
- Problem 1 Probability of an Observation Sequence
- The Trellis
- Forward Probabilities
- Slide 27
- Forward Algorithm
- Forward Algorithm Complexity
- Backward Probabilities
- Slide 31
- Backward Algorithm
- Problem 2 Decoding
- Viterbi Algorithm
- Core Idea of Viterbi Algorithm
- Slide 36
- Problem 3 Learning
- Problem 3 Learning (If Time Allowshellip)
- Forward-Backward (Baum-Welch) algorithm
- Parameter Re-estimation
- Re-estimating Transition Probabilities
- Slide 42
- Slide 43
- Slide 44
- Re-estimating Initial State Probabilities
- Re-estimation of Emission Probabilities
- The Updated Model
- Expectation Maximization
-