Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag...

Albert Gatt

Corpora and Statistical Methods

POS TaggingAssign each word in continuous text a tag

indicating its part of speech.Essentially a classification problem.

Current state of the art:taggers typically have 96-97% accuracyfigure evaluated on a per-word basisin a corpus with sentences of average length

20 words, 96% accuracy can mean one tagging error per sentence

Sources of difficulty in POS taggingMostly due to ambiguity when words have

more than one possible tag.need context to make a good guess about

POScontext alone won’t suffice

A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

The information sources1. Syntagmatic information: the tags of other words in

the context of w Not sufficient on its own. E.g. Greene/Rubin 1977 describe

a context-only tagger with only 77% accuracy

2. Lexical information (“dictionary”): most common tag(s) for a given word

e.g. in English, many nouns can be used as verbs (flour the pan, wax the car…)

however, their most likely tag remains NN distribution of a word’s usages across different POSs is

uneven: usually, one highly likely, other much less

Tagging in other languages (than English)In English, high reliance on context is a

good idea, because of fixed word order

Free word order languages make this assumption harderCompensation: these languages typically

have rich morphologyGood source of clues for a tagger

Evaluation and error analysis Training a statistical POS tagger requires splitting

corpus into training and test data.Often, we need a development set as well, to tune

parameters.

Using (n-fold) cross-validation is a good idea to save data.randomly divide data into train + testtrain and evaluate on testrepeat n times and take an average

NB: cross-validation requires the whole corpus to be blind.To examine the training data, best to have fixed

training & test sets, perform cross-validation on training data, and final evaluation on test set.

EvaluationTypically carried out against a gold

standard based on accuracy (% correct).

Ideal to compare accuracy of our tagger with:baseline (lower-bound):

standard is to choose the unigram most likely tagceiling (upper bound):

e.g. see how well humans do at the same taskhumans apparently agree on 96-7% tagsmeans it is highly suspect for a tagger to get

100% accuracy

Part 1

HMM taggers

Using Markov modelsBasic idea: sequences of tags are a Markov

Chain:Limited horizon assumption: sufficient to look

at previous tag for information about current tag

Time invariance: The probability of a sequence remains the same over time

Implications/limitationsLimited horizon ignores long-distance

dependencese.g. can’t deal with WH-constructionsChomsky (1957): this was one of the reasons

cited against probabilistic approaches

Time invariance:e.g. P(finite verb|pronoun) is constantbut we may be more likely to find a finite verb

following a pronoun at the start of a sentence than in the middle!

NotationWe let ti range over tags

Let wi range over wordsSubscripts denote position in a sequence

Use superscripts to denote word types:wj = an instance of word type j in the lexicontj = tag t assigned to word wj

Limited horizon property becomes:)|()|( 1,..,11 iiii ttPttP

Basic strategyTraining set of manually tagged text

extract probabilities of tag sequences:

e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005

Next step: estimate the word/tag probabilities:

)(

),()|(

j

jkjk

tC

ttCttP

)(

),()|(

j

jljl

tC

twCtwP

These are basically symbol emission

probabilities

Training the tagger: basic algorithm1. Estimate probability of all possible

sequences of 2 tags in the tagset from training data

2. For each tag tj and for each word wl estimate P(wl| tj).

3. Apply smoothing.

Finding the best tag sequenceGiven: a sentence of n wordsFind: t1,n = the best n tags

Application of Bayes’ ruledenominator can be eliminated as it’s the same

for all tag sequences.

)()|(maxarg)(

)()|(maxarg)|(maxarg

,1,1,1

,1

,1,1,1,1,1

,1

,1,

nnnt

n

nnn

tnn

t

tPtwPwP

tPtwPwtP

n

nni

Finding the best tag sequence The expression needs to be reduced to

parameters that can be estimated from the training corpus

need to make some simplifying assumptions

1. words are independent of eachother2. a word’s identity depends only on its tag

The independence assumption

• Probability of a sequence of words given a sequence of tags is computed as a function of each word independently

)|(...)|()|()()|( 1211

,1,1,1,1 ttPttPtwPtPtwP nn

n

ininnn

The identity assumption

)|(...)|()|(

)|(...)|()|()()|(

1211

1211

,1,1,1,1

ttPttPtwP

ttPttPtwPtPtwP

nn

n

iii

nn

n

ininnn

Probability of a word given a tag sequence = probability a word given its own tag

Tagging with the Markov ModelCan use the Viterbi Algorithm to find the best

sequence of tags given a sequence of words (sentence)

Reminder:

probability of being in state (tag) j at word i on the best path

most probable state (tag) at word i given that we’re in state j

at word i+1

)( ji

1i

The algorithm: initialisation

t other all for t

PERIOD

i

i

0)(

0.1)(

Assume that P(PERIOD) = 1 at end of sentence

Set all other tag probs to 0

Algorithm: induction step

for i = 1 to n step 1:for all tags tj do:

)]|()|()([maxarg)( 11j

ikjk

iTki

ji twPttPtt

)]|()|()([max)( 11

1j

ikjk

iTk

ji twPttPtt

Probability of tag tj at i+1 on best path through i

Most probable tag leading to tj at i+1

Algorithm: backtrace

)(max)...( 11

1j

nTj

n tXXP

)(maxarg 11

1j

nTj

n tX

State at n+1

)( 11 jjj XX for j = n to 1 do:

retrieve the most probable tags for every point in sequence

Calculate probability for the sequence of tags selected

Some observationsThe model is a Hidden Markov Model

we only observe words when we tag

In actuality, during training we have a visible Markov Modelbecause the training corpus provides words

+ tags

“True” HMM taggersApplied to cases where we do not have a large

training corpus

We maintain the usual MM assumptions

Initialisation: use dictionary:set emission probability for a word/tag to 0 if it’s not in

dictionary

Training: apply to data, use forward-backward algorithm

Tagging: exactly as before

Part 2

Transformation-based error-driven learning

Transformation-based learningApproach proposed by Brill (1995)

uses quantitative information at training stage

outcome of training is a set of rulestagging is then symbolic, using the rules

Components:a set of transformation ruleslearning algorithm

Transformations General form: t1 t2

“replace t1 with t2 if certain conditions are satisfied”

Examples:Morphological: Change the tag from NN to NNS if the

word has the suffix "s"dogs_NN dogs_NNS

Syntactic: Change the tag from NN to VB if the word occurs after "to"go_NN to_TO go_VB

Lexical: Change the tag to JJ if deleting the prefix "un" results in a word.uncool_XXX uncool_JJuncle_NN -/-> uncle_JJ

Learning

Unannotated text

Initial state annotatore.g. assign each word its most frequent tag in a dictionary

truth: a manually annotated version of corpus against which to compare

Learner: learns rules by comparing initial state to Truth

rules

Learning algorithmSimple iterative process:

apply a rule to the corpuscompare to the Truthif error rate is reduced, keep the results

A priori specifications:how initial state annotator worksthe space of possible transformations

Brill (1995) used a set of initial templatesthe function to compare the result of applying the

rules to the truth

Non-lexicalised rule templates Take only tags into account, not the shape of words

Change tag a to tag b when:1. The preceding (following) word is tagged z.2. The word two before (after) is tagged z.3. One of the three preceding (following) words is tagged

z.4. The preceding (following) word is tagged z and the

word two before (after) is tagged w.5. …

Lexicalised rule templates Take into account specific words in the

context

Change tag a to tag b when:1. The preceding (following) word is w.2. The word two before (after) is w.3. The current word is w, the preceding

(following) word is w2 and the preceding (following) tag is t.

4. …

Morphological rule templates Usful for completely unknown words. Sensitive to the

word’s “shape”.

Change the tag of an unknown word (from X) to Y if:1. Deleting the prefix (suffix) x, |x| ≤ 4, results in a word2. The first (last) (1,2,3,4) characters of the word are x.3. Adding the character string x as a prefix (suffix)

results in a word (|x| ≤ 4).4. Word w ever appears immediately to the left (right) of

the word.5. Character z appears in the word.6. …

Order-dependence of rulesRules are triggered by environments

satisfying their conditionsE.g. “AB if preceding tag is A”Suppose our sequence is “AAAA”Two possible forms of rule application:

immediate effect: applications of the same transformation can influence eachotherresult: ABAB

delayed effect: results in ABBB the rule is triggered multiple times from the same

initial inputBrill (1995) opts for this solution

More on Transformation-based taggingCan be used for unsupervised learning

like HMM-based tagging, the only info available is the allowable tags for each word

takes advantage of the fact that most words have only one tag

E.g. word can = NN in context AT ___ BEZ because most other words in this context are NN

therefore, learning algorithm would learn the rule “change tag to NN in context AT ___ BEZ”

Unsupervised method achieves 95.6% accuracy!!

Part 3

Maximum Entropy models and POS Tagging

Limitations of HMMs An HMM tagger relies on:

P(tag|previous tag)P(word|tag)these are combined by multiplication

TBL includes many other useful features which are hard to model in HMM:prefixes, suffixescapitalisation…

Can we combine both, i.e. have HMM-style tagging with multiple features?

The rationale In order to tag a word, we consider its context or

“history” h. We want to estimate a probability distribution p(h,t) from sparse data.h is encoded in terms of features (e.g. morphological

features, surrounding tag features etc)

There are some constraints on these features that we discover from training data.

We want our model to make the fewest possible assumptions beyond these constraints.

Motivating exampleSuppose we wanted to tag the word w.

Assume we have a set T of 45 different tags: T ={NN, JJ, NNS, NNP, VVS, VB, …}

The probabilistic tagging model that makes fewest assumptions assigns a uniform distribution over the tags:

45

1)(: tPTt

Motivating exampleSuppose we find that the possible tags for w

are NN, JJ, NNS, VB.

We therefore impose our first constraint on the model:

(and the prob. of every other tag is 0)

The simplest model satisfying this constraint:

1)()()()( VBpNNSpJJpNNp

4

1)()()()( VBpNNSpJJpNNp

Motivating exampleWe suddenly discover that w is tagged as NN or NNS 8

out of 10 times.Model now has two constraints:

Again, we require our model to make no further assumptions. Simplest distribution leaves probabilities for all tags except NN/NNS equal:P(NN) = 4/10P(NNS) = 4/10P(JJ) = 1/10P(VB) = 1/10

1)()()()( VBPNNSPJJPNNP

10

8)or and ( NNStNNtwwordP

Motivating exampleWe suddenly discover that verbs (VB) occur 1 in every

20 words.Model now has three constraints:

Simplest distribution is now:P(NN) = 4/10P(NNS) = 4/10P(JJ) = 3/20P(VB) = 1/20

1)()()()( VBPNNSPJJPNNP

10

8)or and ( NNStNNtwwordP

20

1)( VBP

What we’ve been doingMaximum entropy builds a distribution by

continuously adding features.

Each feature picks out a subset of the training observations.

For each feature, we add a constraint on our total distribution.

Our task is then to find the best distribution given the constraints.

Features for POS Tagging Each tagging decision for a word occurs in a specific

context or “history” h.

For tagging, we consider as context:the word itselfmorphological properties of the wordother words surrounding the wordprevious tags

For each relevant aspect of the context hi, we can define a feature fj that allows us to learn how well that aspect is associated with a tag ti.

Probability of a tag given a context is a weighted function of the features.

Features for POS Tagging In a maximum entropy model, this information is captured

by a binary or indicator feature

each feature fi has a weight αi reflecting its importanceNB: each αi is uniquely associated with a feature

otherwise 0

& "")suffix(w if 1),( i VBGting

thf iiij

otherwise 0

& & if 1),( 12 VBtNNtDETt

thf iiiiij

Features for POS Tagging in Ratnaparkhi (1996)Had three sets of features, for non-rare,

rare and all words:

Features for POS TaggingGiven the large number of possible

features, which ones will be part of the model?

We do not want redundant featuresWe do not want unreliable and rarely

occurring features (avoid overfitting)Ratnaparkhi (1996) used only those

features which occur 10 times or more in the training data

The form of the model

Features fj and their parameters are used to compute the probability p(hi, ti):

where j ranges over features & Z is a normalisation constant

Transform into a linear equation:

j

thfjii

iij

Zthp ),(1

),(

j

jiijii thfZthp log),(log),(log

Conditional probabilities The conditional probabilities can be computed based on the joint

probabilities

Probability of a sequence of tags given a sequence of words:

NB: unlike an HMM, we have one probability here. we directly estimate p(t|h) model combines all features in hi into a single estimate no limit in principle on what features we can take into account

'' ),(

),()|(

iii

iiii

thp

thphtp

n

iiinn htpwwttp

111 )|(),...,|,...,(

The use of constraintsEvery feature we have imposes a constraint

or expectation on the probability model. We want:

Where:

)()( jj fEfE

),(1

)(1

ii

n

ijj thf

nfE

the model p’s expectation of fj

the empirical expectation of fj

th

jj thfthpfE,

),(),()(

Why maximum entropy? Recall that entropy is a measure of uncertainty in a distribution. Without any knowledge, simplest distribution is uniform

uniform distributions have the highest entropy As we add constraints, the MaxEnt principle dictates that we find

the simplest model p* satisfying the constraints:

where P is the set of possible distributions with p* is unique and has the form given earlier

Basically, an application of Occam’s razor: make no further assumptions than necessary.

)(maxarg* pHp Pp

)()( jj fEfE

Part 4

Training a MaxEnt model

Training 1: computing empirical expectation

Recall that:

Suppose we are interested in the feature:

In a corpus of 10k words + tags, where the word moving occurs as VBG 20 times:

),(1

1ii

n

ijj thf

nfE

otherwise 0

& "" wif 1),( i VBGtmoving

thf iiij

10000

11

002.010000

20),(

10000

1),(

1

iiij

n

iiijj thfthf

nfE

Training 2: computing model expectation

Recall that:

Requires sum over all possible histories and tags!

Approximate by computing model expectation of the feature on training data only:

th

jj thfthpfE,

),(),()(

t

iji

N

iij thfhtphpfE ),()|()()(

1

Learning the optimal parameters

Our goal is to learn the best parameter αj for each feature fj, such that:

i.e.:

One method is Generalised Iterative Scaling

)()( jj fEfE

t

iji

N

iiii

n

ij thfhtphpthf

n),()|()(),(

1

11

Generalised Iterative Scaling: assumptions

for all (h,t), features sum to a constant value:

If this is not the case, we set C to:

and add a filler feature fl, such that:

Cthfj

j ),(

j

jth thfC ),(max ),(

j

jl thfCthfth ),(),(:),(

Generalised Iterative Scaling: assumptions (II)

for all (h,t), there is at least one feature f which is active, i.e.:

1),(::),( thffth

Generalised Iterative Scaling

Input: Features f1, ..., fn and empirical distribution

Output: Optimal parameter values α1, ..., αn

1. Initialise αi = 0 for all i Є {1, 2, ..., n}

2. For each i do:a.set b.set

3. If model has not converged, repeat from (2)

)(

)(log

1

i

ii fE

fE

C

iii

Part 5

Finding tag sequences with MaxEnt models

Tagging sequences

We want to tag a sequence w1, ...,wn

This can be decomposed into:

The history hi consists of the words w1, ...,wi-1 and previous tags t1, ..., ti−1

n

iiinn htpwwttp

11,1 )|(),...,|...,(

Finding the best tag sequence: beam search (Ratnaparki, 1996) To find the best sequence of n tags given N

features. sij = the jth highest probability tag sequence up to

word i.

1. Generate all tags for w1

a. find the top N tagsb. set s1j for 1 ≤ j ≤ N

2. for i = 2 to n do:a. for j = 1 to N do:

i. Generate tags for wi given s(i-1)j

ii. Append each tag to s(i-1)j to create new sequenceb. Find the N highest probability sequences generated

by loop 2a.

3. Return sn1

Worked exampleSuppose our data consists of the sequence:

a, b, c

Assume the correct tags are A, B, C

Assume that N = 1 (i.e. we only ever consider the top most likely tag)

Worked Example

a b c

s11 s21 s31

Step 1: generate all possible tags for a: A, B, CStep 2: find the most likely tag for a: A

Worked Example

a b c

s11 s21 s31

A

Step 2: generate all possible tags for b: A, B, C merge with s11: A-A, A-B, A-C Find the most likely sequence: A-B

Worked Example

a b c

s11 s21 s31

A A-B

Step 3: generate all possible tags for w3: A, B, C merge with s21: A-B-A, A-B-B, A-B-C Find the most likely: A-B-C

Worked Example

a b c

s11 s21 s31

A A-B A-B-C

Return s31 (=A-B-C)

Part 6

Markov Models vs. MaxEnt

HMM vs MaxEntStandard HMMs cannot compute

conditional probability directly. E.g. for tagging:

we want p(t1,n|w1,n)we obtain it via Bayes’ rule, combining p(w1,n|t1,n)

with the prior p(t1,n)

HMMs are generative models which optimise p(w1,n|t1,n)

By contrast, a MaxEnt Markov Model (MEMM) is a discriminative model which optimises p(t1,n|w1,n) directly.

Graphically (after Jurafsky & Martin 2009)

HMM has separate models for P(w|t) and for P(t)

MEMM has a single model to estimate P(t|w)

Adapting ViterbiWe can easily adapt the Viterbi algorithm to

find the best state sequence in a MEMM.Recall that with HMMs:

Adaptation for MEMMs:

),|()(max)1( 1 tijii

j ossPtt

)|()|()(max

)(max)1(

1

1

jtijii

ijoijii

j

soPssPt

battt

SummaryMaxEnt is a powerful classification model

with some advantages over HMM:direct computation of conditional

probabilities from the training datacan handle multiple features

First introduced by Ratnaparkhi (1996) for POS Tagging. Used for many other applications since then.

Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag...

Documents

Transcript of Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag...