Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

48
Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391

Transcript of Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

Page 1: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

Part of Speech Taggingamp Hidden Markov Models

Mitch Marcus

CSE 391

CIS 391 - Intro to AI 2

NLP Task I ndash Determining Part of Speech Tags

The Problem

Word POS listing in Brown Corpus

heat noun verb

oil noun

in prep noun adv

a det noun noun-proper

large adj noun adv

pot noun

CIS 391 - Intro to AI 3

NLP Task I ndash Determining Part of Speech Tags

The Old Solution Depth First search bull If each of n words has k tags on average try the

nk combinations until one works

Machine Learning Solutions Automatically learn Part of Speech (POS) assignmentbull The best techniques achieve 97+ accuracy per word on

new materials given large training corpora

CIS 391 - Intro to AI 4

What is POS tagging good for

Speech synthesisbull How to pronounce ldquoleadrdquobull INsult inSULTbull OBject obJECTbull OVERflow overFLOWbull DIScount disCOUNTbull CONtent conTENT

Stemming for information retrievalbull Knowing a word is a N tells you it gets pluralsbull Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etcbull Possessive pronouns (my your her) followed by nounsbull Personal pronouns (I you he) likely to be followed by verbs

CIS 391 - Intro to AI 5

Equivalent Problem in Bioinformatics Durbin et al Biological Sequence

Analysis Cambridge University Press

Several applications eg proteins From primary structure

ATCPLELLLD Infer secondary structure

HHHBBBBBC

CIS 391 - Intro to AI 6

Penn Treebank Tagset I

Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings

CIS 391 - Intro to AI 7

Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh

Penn Treebank Tagset II

CIS 391 - Intro to AI 8

Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when

Penn Treebank Tagset III

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 2: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 2

NLP Task I ndash Determining Part of Speech Tags

The Problem

Word POS listing in Brown Corpus

heat noun verb

oil noun

in prep noun adv

a det noun noun-proper

large adj noun adv

pot noun

CIS 391 - Intro to AI 3

NLP Task I ndash Determining Part of Speech Tags

The Old Solution Depth First search bull If each of n words has k tags on average try the

nk combinations until one works

Machine Learning Solutions Automatically learn Part of Speech (POS) assignmentbull The best techniques achieve 97+ accuracy per word on

new materials given large training corpora

CIS 391 - Intro to AI 4

What is POS tagging good for

Speech synthesisbull How to pronounce ldquoleadrdquobull INsult inSULTbull OBject obJECTbull OVERflow overFLOWbull DIScount disCOUNTbull CONtent conTENT

Stemming for information retrievalbull Knowing a word is a N tells you it gets pluralsbull Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etcbull Possessive pronouns (my your her) followed by nounsbull Personal pronouns (I you he) likely to be followed by verbs

CIS 391 - Intro to AI 5

Equivalent Problem in Bioinformatics Durbin et al Biological Sequence

Analysis Cambridge University Press

Several applications eg proteins From primary structure

ATCPLELLLD Infer secondary structure

HHHBBBBBC

CIS 391 - Intro to AI 6

Penn Treebank Tagset I

Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings

CIS 391 - Intro to AI 7

Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh

Penn Treebank Tagset II

CIS 391 - Intro to AI 8

Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when

Penn Treebank Tagset III

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 3: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 3

NLP Task I ndash Determining Part of Speech Tags

The Old Solution Depth First search bull If each of n words has k tags on average try the

nk combinations until one works

Machine Learning Solutions Automatically learn Part of Speech (POS) assignmentbull The best techniques achieve 97+ accuracy per word on

new materials given large training corpora

CIS 391 - Intro to AI 4

What is POS tagging good for

Speech synthesisbull How to pronounce ldquoleadrdquobull INsult inSULTbull OBject obJECTbull OVERflow overFLOWbull DIScount disCOUNTbull CONtent conTENT

Stemming for information retrievalbull Knowing a word is a N tells you it gets pluralsbull Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etcbull Possessive pronouns (my your her) followed by nounsbull Personal pronouns (I you he) likely to be followed by verbs

CIS 391 - Intro to AI 5

Equivalent Problem in Bioinformatics Durbin et al Biological Sequence

Analysis Cambridge University Press

Several applications eg proteins From primary structure

ATCPLELLLD Infer secondary structure

HHHBBBBBC

CIS 391 - Intro to AI 6

Penn Treebank Tagset I

Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings

CIS 391 - Intro to AI 7

Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh

Penn Treebank Tagset II

CIS 391 - Intro to AI 8

Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when

Penn Treebank Tagset III

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 4: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 4

What is POS tagging good for

Speech synthesisbull How to pronounce ldquoleadrdquobull INsult inSULTbull OBject obJECTbull OVERflow overFLOWbull DIScount disCOUNTbull CONtent conTENT

Stemming for information retrievalbull Knowing a word is a N tells you it gets pluralsbull Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etcbull Possessive pronouns (my your her) followed by nounsbull Personal pronouns (I you he) likely to be followed by verbs

CIS 391 - Intro to AI 5

Equivalent Problem in Bioinformatics Durbin et al Biological Sequence

Analysis Cambridge University Press

Several applications eg proteins From primary structure

ATCPLELLLD Infer secondary structure

HHHBBBBBC

CIS 391 - Intro to AI 6

Penn Treebank Tagset I

Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings

CIS 391 - Intro to AI 7

Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh

Penn Treebank Tagset II

CIS 391 - Intro to AI 8

Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when

Penn Treebank Tagset III

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 5: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 5

Equivalent Problem in Bioinformatics Durbin et al Biological Sequence

Analysis Cambridge University Press

Several applications eg proteins From primary structure

ATCPLELLLD Infer secondary structure

HHHBBBBBC

CIS 391 - Intro to AI 6

Penn Treebank Tagset I

Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings

CIS 391 - Intro to AI 7

Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh

Penn Treebank Tagset II

CIS 391 - Intro to AI 8

Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when

Penn Treebank Tagset III

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 6: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 6

Penn Treebank Tagset I

Tag Description Example CC coordinating conjunction and CD cardinal number 1 third DT determiner the EX existential there there is FW foreign word dhoevre IN prepositionsubordinating conjunction in of like JJ adjective green JJR adjective comparative greener JJS adjective superlative greenest LS list marker 1) MD modal could will NN noun singular or mass table NNS noun plural tables NNP proper noun singular John NNPS proper noun plural Vikings

CIS 391 - Intro to AI 7

Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh

Penn Treebank Tagset II

CIS 391 - Intro to AI 8

Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when

Penn Treebank Tagset III

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 7: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 7

Tag Description Example PDT predeterminer both the boys POS possessive ending friend s PRP personal pronoun I me him he it PRP$ possessive pronoun my his RB adverb however usually here good RBR adverb comparative betterRBS adverb superlative best RP particle give up TO to to go to him UH interjection uhhuhhuhh

Penn Treebank Tagset II

CIS 391 - Intro to AI 8

Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when

Penn Treebank Tagset III

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 8: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 8

Tag Description Example VB verb base form take VBD verb past tense took VBG verb gerundpresent participle taking VBN verb past participle taken VBP verb sing present non-3d take VBZ verb 3rd person sing present takes WDT wh-determiner which WP wh-pronoun who what WP$ possessive wh-pronoun whose WRB wh-abverb where when

Penn Treebank Tagset III

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 9: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 9

Simple Statistical Approaches Idea 1

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 10: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 10

Simple Statistical Approaches Idea 2

For a string of words

W = w1w2w3hellipwn

find the string of POS tags

T = t1 t2 t3 helliptn

which maximizes P(T|W)

bull ie the most likely POS tag ti for each word wi given its surrounding context

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 11: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 11

The Sparse Data Problem hellip

A Simple Impossible Approach to Compute P(T|W)

Count up instances of the string heat oil in a large pot in the training corpus and pick the most common tag assignment to the string

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 12: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 12

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with a million words of hand tagged training databull Assume a uniform distribution of 5000 words and 40 part of speech

tags

Rich Models often require vast amounts of data Good estimates of models with bad assumptions often

outperform better models which are badly estimated

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 13: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 13

A Practical Statistical Tagger

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 14: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 14

A Practical Statistical Tagger II

But we cant accurately estimate more than tag bigrams or sohellip

Again we change to a model that we CAN estimate

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 15: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 15

A Practical Statistical Tagger III

So for a given string W = w1w2w3hellipwn the tagger needs to find the string of tags T which maximizes

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 16: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 16

Training and Performance

To estimate the parameters of this model given an annotated training corpus

Because many of these counts are small smoothing is necessary for best resultshellip

Such taggers typically achieve about 95-96 correct tagging for tag sets of 40-80 tags

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 17: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 17

Hidden Markov Models

This model is an instance of a Hidden Markov Model Viewed graphically

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1P(w|Det)

a 4the 4

P(w|Adj)good 02low 04

P(w|Noun)price 001deal 0001

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 18: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 18

Viewed as a generator an HMM

Adj

3

6Det

02

47 Noun

3

7 Verb

51 1

4the

4a

P(w|Det)

04low

02good

P(w|Adj)

0001deal

001price

P(w|Noun)

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 19: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 19

Recognition using an HMM

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 20: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 20

A Practical Statistical Tagger IV

Finding this maximum can be done using an exponential search through all strings for T

However there is a linear timelinear time solution using dynamic programming called Viterbi decoding

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 21: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 21

Parameters of an HMM

States A set of states S=s1hellipsn

Transition probabilities A= a11a12hellipann Each aij represents the probability of transitioning from state si to sj

Emission probabilities A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si

Initial state distribution is the probability that si is a start state

i

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 22: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 22

The Three Basic HMM Problems

Problem 1 (Evaluation) Given the observation sequence O=o1hellipoT and an HMM model how do we compute the probability of O given the model

Problem 2 (Decoding) Given the observation sequence O=o1hellipoT and an HMM model

how do we find the state sequence that best explains the observations

(AB )

(AB )

(This and following slides follow classic formulation by Rabiner and Juang as adapted by Manning and Schutze Slides adapted from Dorr)

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 23: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 23

Problem 3 (Learning) How do we adjust the model parameters to maximize

The Three Basic HMM Problems

(AB )

P(O | )

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 24: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 24

Problem 1 Probability of an Observation Sequence

What is The probability of a observation sequence is the

sum of the probabilities of all possible state sequences in the HMM

Naiumlve computation is very expensive Given T observations and N states there are NT possible state sequences

Even small HMMs eg T=10 and N=10 contain 10 billion different paths

Solution to this and problem 2 is to use dynamic programming

P(O | )

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 25: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 25

The Trellis

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 26: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 26

Forward Probabilities

What is the probability that given an HMM at time t the state is i and the partial observation o1 hellip ot has been generated

t (i) P(o1 ot qt si | )

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 27: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 27

Forward Probabilities

t ( j) t 1(i) aij

i1

N

b j (ot )

t (i) P(o1ot qt si | )

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 28: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 28

Forward Algorithm

Initialization

Induction

Termination

t ( j) t 1(i) aij

i1

N

b j (ot ) 2 t T1 j N

1(i) ibi(o1) 1i N

P(O | ) T (i)i1

N

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 29: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 29

Forward Algorithm Complexity

Naiumlve approach takes O(2TNT) computation Forward algorithm using dynamic programming

takes O(N2T) computations

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 30: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 30

Backward Probabilities

What is the probability that given an HMM and given the state at time t is i the partial observation ot+1 hellip oT is generated

Analogous to forward probability just in the other direction

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 31: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 31

Backward Probabilities

t (i) aijb j (ot1)t1( j)j1

N

t (i) P(ot1oT | qt si)

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 32: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 32

Backward Algorithm

Initialization

Induction

Termination

T (i) 1 1i N

t (i) aijb j (ot1)t1( j)j1

N

t T 111i N

P(O | ) i 1(i)i1

N

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 33: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 33

Problem 2 Decoding

The Forward algorithm gives the sum of all paths through an HMM efficiently

Here we want to find the highest probability path

We want to find the state sequence Q=q1hellipqT such that

Q argmaxQ

P(Q | O)

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 34: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 34

Viterbi Algorithm

Similar to computing the forward probabilities but instead of summing over transitions from incoming states compute the maximum

Forward

Viterbi Recursion

t ( j) t 1(i)aij

i1

N

b j (ot )

t ( j) max1iN

t 1(i)aij b j (ot )

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 35: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 35

Core Idea of Viterbi Algorithm

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 36: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 36

Viterbi Algorithm

Initialization Induction

Termination

Read out path

1(i) ib j (o1) 1i N

t ( j) max1iN

t 1(i) aij b j (ot )

t ( j) argmax1iN

t 1(i) aij

2 t T1 j N

p max1iN

T (i)

qT argmax

1iNT (i)

qt t1(qt1

) t T 11

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 37: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 37

Problem 3 Learning

Up to now wersquove assumed that we know the underlying model

Often these parameters are estimated on annotated training data but Annotation is often difficult andor expensive Training data is different from the current data

We want to maximize the parameters with respect to the current data ie wersquore looking for a model such that

(AB )

argmax

P(O | )

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 38: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 38

Problem 3 Learning (If Time Allowshellip)

Unfortunately there is no known way to analytically find a global maximum ie a model such that

But it is possible to find a local maximum

Given an initial model we can always find a model such that

argmax

P(O | )

P(O | ) P(O | )

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 39: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 39

Forward-Backward (Baum-Welch) algorithm

Key Idea parameter re-estimation by hill-climbing

From an arbitrary initial parameter instantiation the FB algorithm iteratively re-estimates the parameters improving the probability that a given observation was generated by

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 40: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 40

Parameter Re-estimation

Three parameters need to be re-estimatedbull Initial state distribution

bull Transition probabilities aij

bull Emission probabilities bi(ot)

i

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 41: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 41

Re-estimating Transition Probabilities

Whatrsquos the probability of being in state si at time t and going to state sj given the current model and parameters

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 42: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 42

Re-estimating Transition Probabilities

t (i j) t (i) ai j b j (ot1) t1( j)

t (i) ai j b j (ot1) t1( j)j1

N

i1

N

t (i j) P(qt si qt1 s j | O)

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 43: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 43

Re-estimating Transition Probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formallyi

ji

ji s statefrom stransition of number expected

s stateto s statefrom stransition of number expecteda =

ˆ a i j t (i j)

t1

T 1

t (i j )j 1

N

t1

T 1

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 44: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 44

Re-estimating Transition Probabilities

Defining

As the probability of being in state si given the complete observation O

We can say

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

t (i) t (i j)j1

N

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 45: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 45

Re-estimating Initial State Probabilities

Initial state distribution is the probability that si is a start state

Re-estimation is easy

Formally

i

1 time at s statein times of number expectedπ ii =

ˆ i 1(i)

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 46: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 46

Re-estimation of Emission Probabilities

Emission probabilities are re-estimated as

Formally

where Note that here is the Kronecker delta function and

is not related to the in the discussion of the Viterbi algorithm

i

kii s statein times of number expected

v symbolobserve and s statein times of number expected)k(b =

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

(ot vk ) 1 if ot vk and 0 otherwise

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 47: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 47

The Updated Model

Coming from we get to

by the following update rules

(AB )

( ˆ A ˆ B ˆ )

ˆ b i(k) (ot vk )t (i)

t1

T

t (i)t1

T

ˆ a i j t (i j)

t1

T 1

t (i)t1

T 1

ˆ i 1(i)

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization
Page 48: Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.

CIS 391 - Intro to AI 48

Expectation Maximization

The forward-backward algorithm is an instance of the more general EM algorithmbull The E Step Compute the forward and backward

probabilities for a give modelbull The M Step Re-estimate the model parameters

  • Part of Speech Tagging amp Hidden Markov Models
  • NLP Task I ndash Determining Part of Speech Tags
  • Slide 3
  • What is POS tagging good for
  • Equivalent Problem in Bioinformatics
  • Penn Treebank Tagset I
  • Slide 7
  • Slide 8
  • Simple Statistical Approaches Idea 1
  • Simple Statistical Approaches Idea 2
  • The Sparse Data Problem hellip
  • A BOTEC Estimate of What We Can Estimate
  • A Practical Statistical Tagger
  • A Practical Statistical Tagger II
  • A Practical Statistical Tagger III
  • Training and Performance
  • Hidden Markov Models
  • Viewed as a generator an HMM
  • Recognition using an HMM
  • A Practical Statistical Tagger IV
  • Parameters of an HMM
  • The Three Basic HMM Problems
  • Slide 23
  • Problem 1 Probability of an Observation Sequence
  • The Trellis
  • Forward Probabilities
  • Slide 27
  • Forward Algorithm
  • Forward Algorithm Complexity
  • Backward Probabilities
  • Slide 31
  • Backward Algorithm
  • Problem 2 Decoding
  • Viterbi Algorithm
  • Core Idea of Viterbi Algorithm
  • Slide 36
  • Problem 3 Learning
  • Problem 3 Learning (If Time Allowshellip)
  • Forward-Backward (Baum-Welch) algorithm
  • Parameter Re-estimation
  • Re-estimating Transition Probabilities
  • Slide 42
  • Slide 43
  • Slide 44
  • Re-estimating Initial State Probabilities
  • Re-estimation of Emission Probabilities
  • The Updated Model
  • Expectation Maximization