Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag...

71
Albert Gatt Corpora and Statistical Methods

Transcript of Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag...

Page 1: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Albert Gatt

Corpora and Statistical Methods

Page 2: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

POS TaggingAssign each word in continuous text a tag

indicating its part of speech.Essentially a classification problem.

Current state of the art:taggers typically have 96-97% accuracyfigure evaluated on a per-word basisin a corpus with sentences of average length

20 words, 96% accuracy can mean one tagging error per sentence

Page 3: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Sources of difficulty in POS taggingMostly due to ambiguity when words have

more than one possible tag.need context to make a good guess about

POScontext alone won’t suffice

A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

Page 4: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

The information sources1. Syntagmatic information: the tags of other words in

the context of w Not sufficient on its own. E.g. Greene/Rubin 1977 describe

a context-only tagger with only 77% accuracy

2. Lexical information (“dictionary”): most common tag(s) for a given word

e.g. in English, many nouns can be used as verbs (flour the pan, wax the car…)

however, their most likely tag remains NN distribution of a word’s usages across different POSs is

uneven: usually, one highly likely, other much less

Page 5: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Tagging in other languages (than English)In English, high reliance on context is a

good idea, because of fixed word order

Free word order languages make this assumption harderCompensation: these languages typically

have rich morphologyGood source of clues for a tagger

Page 6: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Evaluation and error analysis Training a statistical POS tagger requires splitting

corpus into training and test data.Often, we need a development set as well, to tune

parameters.

Using (n-fold) cross-validation is a good idea to save data.randomly divide data into train + testtrain and evaluate on testrepeat n times and take an average

NB: cross-validation requires the whole corpus to be blind.To examine the training data, best to have fixed

training & test sets, perform cross-validation on training data, and final evaluation on test set.

Page 7: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

EvaluationTypically carried out against a gold

standard based on accuracy (% correct).

Ideal to compare accuracy of our tagger with:baseline (lower-bound):

standard is to choose the unigram most likely tagceiling (upper bound):

e.g. see how well humans do at the same taskhumans apparently agree on 96-7% tagsmeans it is highly suspect for a tagger to get

100% accuracy

Page 8: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Part 1

HMM taggers

Page 9: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Using Markov modelsBasic idea: sequences of tags are a Markov

Chain:Limited horizon assumption: sufficient to look

at previous tag for information about current tag

Time invariance: The probability of a sequence remains the same over time

Page 10: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Implications/limitationsLimited horizon ignores long-distance

dependencese.g. can’t deal with WH-constructionsChomsky (1957): this was one of the reasons

cited against probabilistic approaches

Time invariance:e.g. P(finite verb|pronoun) is constantbut we may be more likely to find a finite verb

following a pronoun at the start of a sentence than in the middle!

Page 11: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

NotationWe let ti range over tags

Let wi range over wordsSubscripts denote position in a sequence

Use superscripts to denote word types:wj = an instance of word type j in the lexicontj = tag t assigned to word wj

Limited horizon property becomes:)|()|( 1,..,11 iiii ttPttP

Page 12: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Basic strategyTraining set of manually tagged text

extract probabilities of tag sequences:

e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005

Next step: estimate the word/tag probabilities:

)(

),()|(

j

jkjk

tC

ttCttP

)(

),()|(

j

jljl

tC

twCtwP

These are basically symbol emission

probabilities

Page 13: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Training the tagger: basic algorithm1. Estimate probability of all possible

sequences of 2 tags in the tagset from training data

2. For each tag tj and for each word wl estimate P(wl| tj).

3. Apply smoothing.

Page 14: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Finding the best tag sequenceGiven: a sentence of n wordsFind: t1,n = the best n tags

Application of Bayes’ ruledenominator can be eliminated as it’s the same

for all tag sequences.

)()|(maxarg)(

)()|(maxarg)|(maxarg

,1,1,1

,1

,1,1,1,1,1

,1

,1,

nnnt

n

nnn

tnn

t

tPtwPwP

tPtwPwtP

n

nni

Page 15: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Finding the best tag sequence The expression needs to be reduced to

parameters that can be estimated from the training corpus

need to make some simplifying assumptions

1. words are independent of eachother2. a word’s identity depends only on its tag

Page 16: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

The independence assumption

• Probability of a sequence of words given a sequence of tags is computed as a function of each word independently

)|(...)|()|()()|( 1211

,1,1,1,1 ttPttPtwPtPtwP nn

n

ininnn

Page 17: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

The identity assumption

)|(...)|()|(

)|(...)|()|()()|(

1211

1211

,1,1,1,1

ttPttPtwP

ttPttPtwPtPtwP

nn

n

iii

nn

n

ininnn

Probability of a word given a tag sequence = probability a word given its own tag

Page 18: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Applying these assumptions

)()|(maxarg)(

)()|(maxarg)|(maxarg

,1,1,1

,1

,1,1,1,1,1

,1

,1,

nnnt

n

nnn

tnn

t

tPtwPwP

tPtwPwtP

n

nni

n

iiiii

tttPtwP

n 11)|()|(maxarg

,1

Page 19: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Tagging with the Markov ModelCan use the Viterbi Algorithm to find the best

sequence of tags given a sequence of words (sentence)

Reminder:

probability of being in state (tag) j at word i on the best path

most probable state (tag) at word i given that we’re in state j

at word i+1

)( ji

1i

Page 20: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

The algorithm: initialisation

t other all for t

PERIOD

i

i

0)(

0.1)(

Assume that P(PERIOD) = 1 at end of sentence

Set all other tag probs to 0

Page 21: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Algorithm: induction step

for i = 1 to n step 1:for all tags tj do:

)]|()|()([maxarg)( 11j

ikjk

iTki

ji twPttPtt

)]|()|()([max)( 11

1j

ikjk

iTk

ji twPttPtt

Probability of tag tj at i+1 on best path through i

Most probable tag leading to tj at i+1

Page 22: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Algorithm: backtrace

)(max)...( 11

1j

nTj

n tXXP

)(maxarg 11

1j

nTj

n tX

State at n+1

)( 11 jjj XX for j = n to 1 do:

retrieve the most probable tags for every point in sequence

Calculate probability for the sequence of tags selected

Page 23: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Some observationsThe model is a Hidden Markov Model

we only observe words when we tag

In actuality, during training we have a visible Markov Modelbecause the training corpus provides words

+ tags

Page 24: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

“True” HMM taggersApplied to cases where we do not have a large

training corpus

We maintain the usual MM assumptions

Initialisation: use dictionary:set emission probability for a word/tag to 0 if it’s not in

dictionary

Training: apply to data, use forward-backward algorithm

Tagging: exactly as before

Page 25: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Part 2

Transformation-based error-driven learning

Page 26: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Transformation-based learningApproach proposed by Brill (1995)

uses quantitative information at training stage

outcome of training is a set of rulestagging is then symbolic, using the rules

Components:a set of transformation ruleslearning algorithm

Page 27: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Transformations General form: t1 t2

“replace t1 with t2 if certain conditions are satisfied”

Examples:Morphological: Change the tag from NN to NNS if the

word has the suffix "s"dogs_NN dogs_NNS

Syntactic: Change the tag from NN to VB if the word occurs after "to"go_NN to_TO go_VB

Lexical: Change the tag to JJ if deleting the prefix "un" results in a word.uncool_XXX uncool_JJuncle_NN -/-> uncle_JJ

Page 28: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Learning

Unannotated text

Initial state annotatore.g. assign each word its most frequent tag in a dictionary

truth: a manually annotated version of corpus against which to compare

Learner: learns rules by comparing initial state to Truth

rules

Page 29: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Learning algorithmSimple iterative process:

apply a rule to the corpuscompare to the Truthif error rate is reduced, keep the results

A priori specifications:how initial state annotator worksthe space of possible transformations

Brill (1995) used a set of initial templatesthe function to compare the result of applying the

rules to the truth

Page 30: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Non-lexicalised rule templates Take only tags into account, not the shape of words

Change tag a to tag b when:1. The preceding (following) word is tagged z.2. The word two before (after) is tagged z.3. One of the three preceding (following) words is tagged

z.4. The preceding (following) word is tagged z and the

word two before (after) is tagged w.5. …

Page 31: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Lexicalised rule templates Take into account specific words in the

context

Change tag a to tag b when:1. The preceding (following) word is w.2. The word two before (after) is w.3. The current word is w, the preceding

(following) word is w2 and the preceding (following) tag is t.

4. …

Page 32: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Morphological rule templates Usful for completely unknown words. Sensitive to the

word’s “shape”.

Change the tag of an unknown word (from X) to Y if:1. Deleting the prefix (suffix) x, |x| ≤ 4, results in a word2. The first (last) (1,2,3,4) characters of the word are x.3. Adding the character string x as a prefix (suffix)

results in a word (|x| ≤ 4).4. Word w ever appears immediately to the left (right) of

the word.5. Character z appears in the word.6. …

Page 33: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Order-dependence of rulesRules are triggered by environments

satisfying their conditionsE.g. “AB if preceding tag is A”Suppose our sequence is “AAAA”Two possible forms of rule application:

immediate effect: applications of the same transformation can influence eachotherresult: ABAB

delayed effect: results in ABBB the rule is triggered multiple times from the same

initial inputBrill (1995) opts for this solution

Page 34: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

More on Transformation-based taggingCan be used for unsupervised learning

like HMM-based tagging, the only info available is the allowable tags for each word

takes advantage of the fact that most words have only one tag

E.g. word can = NN in context AT ___ BEZ because most other words in this context are NN

therefore, learning algorithm would learn the rule “change tag to NN in context AT ___ BEZ”

Unsupervised method achieves 95.6% accuracy!!

Page 35: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Part 3

Maximum Entropy models and POS Tagging

Page 36: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Limitations of HMMs An HMM tagger relies on:

P(tag|previous tag)P(word|tag)these are combined by multiplication

TBL includes many other useful features which are hard to model in HMM:prefixes, suffixescapitalisation…

Can we combine both, i.e. have HMM-style tagging with multiple features?

Page 37: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

The rationale In order to tag a word, we consider its context or

“history” h. We want to estimate a probability distribution p(h,t) from sparse data.h is encoded in terms of features (e.g. morphological

features, surrounding tag features etc)

There are some constraints on these features that we discover from training data.

We want our model to make the fewest possible assumptions beyond these constraints.

Page 38: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Motivating exampleSuppose we wanted to tag the word w.

Assume we have a set T of 45 different tags: T ={NN, JJ, NNS, NNP, VVS, VB, …}

The probabilistic tagging model that makes fewest assumptions assigns a uniform distribution over the tags:

45

1)(: tPTt

Page 39: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Motivating exampleSuppose we find that the possible tags for w

are NN, JJ, NNS, VB.

We therefore impose our first constraint on the model:

(and the prob. of every other tag is 0)

The simplest model satisfying this constraint:

1)()()()( VBpNNSpJJpNNp

4

1)()()()( VBpNNSpJJpNNp

Page 40: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Motivating exampleWe suddenly discover that w is tagged as NN or NNS 8

out of 10 times.Model now has two constraints:

Again, we require our model to make no further assumptions. Simplest distribution leaves probabilities for all tags except NN/NNS equal:P(NN) = 4/10P(NNS) = 4/10P(JJ) = 1/10P(VB) = 1/10

1)()()()( VBPNNSPJJPNNP

10

8)or and ( NNStNNtwwordP

Page 41: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Motivating exampleWe suddenly discover that verbs (VB) occur 1 in every

20 words.Model now has three constraints:

Simplest distribution is now:P(NN) = 4/10P(NNS) = 4/10P(JJ) = 3/20P(VB) = 1/20

1)()()()( VBPNNSPJJPNNP

10

8)or and ( NNStNNtwwordP

20

1)( VBP

Page 42: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

What we’ve been doingMaximum entropy builds a distribution by

continuously adding features.

Each feature picks out a subset of the training observations.

For each feature, we add a constraint on our total distribution.

Our task is then to find the best distribution given the constraints.

Page 43: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Features for POS Tagging Each tagging decision for a word occurs in a specific

context or “history” h.

For tagging, we consider as context:the word itselfmorphological properties of the wordother words surrounding the wordprevious tags

For each relevant aspect of the context hi, we can define a feature fj that allows us to learn how well that aspect is associated with a tag ti.

Probability of a tag given a context is a weighted function of the features.

Page 44: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Features for POS Tagging In a maximum entropy model, this information is captured

by a binary or indicator feature

each feature fi has a weight αi reflecting its importanceNB: each αi is uniquely associated with a feature

otherwise 0

& "")suffix(w if 1),( i VBGting

thf iiij

otherwise 0

& & if 1),( 12 VBtNNtDETt

thf iiiiij

Page 45: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Features for POS Tagging in Ratnaparkhi (1996)Had three sets of features, for non-rare,

rare and all words:

Page 46: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Features for POS TaggingGiven the large number of possible

features, which ones will be part of the model?

We do not want redundant featuresWe do not want unreliable and rarely

occurring features (avoid overfitting)Ratnaparkhi (1996) used only those

features which occur 10 times or more in the training data

Page 47: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

The form of the model

Features fj and their parameters are used to compute the probability p(hi, ti):

where j ranges over features & Z is a normalisation constant

Transform into a linear equation:

j

thfjii

iij

Zthp ),(1

),(

j

jiijii thfZthp log),(log),(log

Page 48: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Conditional probabilities The conditional probabilities can be computed based on the joint

probabilities

Probability of a sequence of tags given a sequence of words:

NB: unlike an HMM, we have one probability here. we directly estimate p(t|h) model combines all features in hi into a single estimate no limit in principle on what features we can take into account

'' ),(

),()|(

iii

iiii

thp

thphtp

n

iiinn htpwwttp

111 )|(),...,|,...,(

Page 49: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

The use of constraintsEvery feature we have imposes a constraint

or expectation on the probability model. We want:

Where:

)()( jj fEfE

),(1

)(1

ii

n

ijj thf

nfE

the model p’s expectation of fj

the empirical expectation of fj

th

jj thfthpfE,

),(),()(

Page 50: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Why maximum entropy? Recall that entropy is a measure of uncertainty in a distribution. Without any knowledge, simplest distribution is uniform

uniform distributions have the highest entropy As we add constraints, the MaxEnt principle dictates that we find

the simplest model p* satisfying the constraints:

where P is the set of possible distributions with p* is unique and has the form given earlier

Basically, an application of Occam’s razor: make no further assumptions than necessary.

)(maxarg* pHp Pp

)()( jj fEfE

Page 51: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Part 4

Training a MaxEnt model

Page 52: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Training 1: computing empirical expectation

Recall that:

Suppose we are interested in the feature:

In a corpus of 10k words + tags, where the word moving occurs as VBG 20 times:

),(1

1ii

n

ijj thf

nfE

otherwise 0

& "" wif 1),( i VBGtmoving

thf iiij

10000

11

002.010000

20),(

10000

1),(

1

iiij

n

iiijj thfthf

nfE

Page 53: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Training 2: computing model expectation

Recall that:

Requires sum over all possible histories and tags!

Approximate by computing model expectation of the feature on training data only:

th

jj thfthpfE,

),(),()(

t

iji

N

iij thfhtphpfE ),()|()()(

1

Page 54: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Learning the optimal parameters

Our goal is to learn the best parameter αj for each feature fj, such that:

i.e.:

One method is Generalised Iterative Scaling

)()( jj fEfE

t

iji

N

iiii

n

ij thfhtphpthf

n),()|()(),(

1

11

Page 55: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Generalised Iterative Scaling: assumptions

for all (h,t), features sum to a constant value:

If this is not the case, we set C to:

and add a filler feature fl, such that:

Cthfj

j ),(

j

jth thfC ),(max ),(

j

jl thfCthfth ),(),(:),(

Page 56: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Generalised Iterative Scaling: assumptions (II)

for all (h,t), there is at least one feature f which is active, i.e.:

1),(::),( thffth

Page 57: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Generalised Iterative Scaling

Input: Features f1, ..., fn and empirical distribution

Output: Optimal parameter values α1, ..., αn

1. Initialise αi = 0 for all i Є {1, 2, ..., n}

2. For each i do:a.set b.set

3. If model has not converged, repeat from (2)

)(

)(log

1

i

ii fE

fE

C

iii

Page 58: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Part 5

Finding tag sequences with MaxEnt models

Page 59: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Tagging sequences

We want to tag a sequence w1, ...,wn

This can be decomposed into:

The history hi consists of the words w1, ...,wi-1 and previous tags t1, ..., ti−1

n

iiinn htpwwttp

11,1 )|(),...,|...,(

Page 60: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Finding the best tag sequence: beam search (Ratnaparki, 1996) To find the best sequence of n tags given N

features. sij = the jth highest probability tag sequence up to

word i.

1. Generate all tags for w1

a. find the top N tagsb. set s1j for 1 ≤ j ≤ N

2. for i = 2 to n do:a. for j = 1 to N do:

i. Generate tags for wi given s(i-1)j

ii. Append each tag to s(i-1)j to create new sequenceb. Find the N highest probability sequences generated

by loop 2a.

3. Return sn1

Page 61: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Worked exampleSuppose our data consists of the sequence:

a, b, c

Assume the correct tags are A, B, C

Assume that N = 1 (i.e. we only ever consider the top most likely tag)

Page 62: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Worked Example

a b c

s11 s21 s31

Step 1: generate all possible tags for a: A, B, CStep 2: find the most likely tag for a: A

Page 63: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Worked Example

a b c

s11 s21 s31

A

Step 2: generate all possible tags for b: A, B, C merge with s11: A-A, A-B, A-C Find the most likely sequence: A-B

Page 64: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Worked Example

a b c

s11 s21 s31

A A-B

Step 3: generate all possible tags for w3: A, B, C merge with s21: A-B-A, A-B-B, A-B-C Find the most likely: A-B-C

Page 65: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Worked Example

a b c

s11 s21 s31

A A-B A-B-C

Return s31 (=A-B-C)

Page 66: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Part 6

Markov Models vs. MaxEnt

Page 67: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

HMM vs MaxEntStandard HMMs cannot compute

conditional probability directly. E.g. for tagging:

we want p(t1,n|w1,n)we obtain it via Bayes’ rule, combining p(w1,n|t1,n)

with the prior p(t1,n)

HMMs are generative models which optimise p(w1,n|t1,n)

By contrast, a MaxEnt Markov Model (MEMM) is a discriminative model which optimises p(t1,n|w1,n) directly.

Page 68: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Graphically (after Jurafsky & Martin 2009)

HMM has separate models for P(w|t) and for P(t)

MEMM has a single model to estimate P(t|w)

Page 69: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

More formally…With an HMM:

With a MEMM:

n

iiii

tnn wttPwtP

n 11,1,1 ),|(maxarg)|(

,1

n

iiiii

tnn ttPtwPwtP

n 11,1,1 )|()|(maxarg)|(

,1

Page 70: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

Adapting ViterbiWe can easily adapt the Viterbi algorithm to

find the best state sequence in a MEMM.Recall that with HMMs:

Adaptation for MEMMs:

),|()(max)1( 1 tijii

j ossPtt

)|()|()(max

)(max)1(

1

1

jtijii

ijoijii

j

soPssPt

battt

Page 71: Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.

SummaryMaxEnt is a powerful classification model

with some advantages over HMM:direct computation of conditional

probabilities from the training datacan handle multiple features

First introduced by Ratnaparkhi (1996) for POS Tagging. Used for many other applications since then.