Corpora and Statistical Methods

Click here to load reader

download Corpora and Statistical Methods

of 71

description

Corpora and Statistical Methods. Albert Gatt. POS Tagging. Assign each word in continuous text a tag indicating its part of speech. Essentially a classification problem. Current state of the art: taggers typically have 96-97% accuracy figure evaluated on a per-word basis - PowerPoint PPT Presentation

Transcript of Corpora and Statistical Methods

PowerPoint Presentation

Albert GattCorpora and Statistical Methods

POS TaggingAssign each word in continuous text a tag indicating its part of speech.Essentially a classification problem.

Current state of the art:taggers typically have 96-97% accuracyfigure evaluated on a per-word basisin a corpus with sentences of average length 20 words, 96% accuracy can mean one tagging error per sentenceSources of difficulty in POS taggingMostly due to ambiguity when words have more than one possible tag.need context to make a good guess about POScontext alone wont suffice

A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

The information sourcesSyntagmatic information: the tags of other words in the context of wNot sufficient on its own. E.g. Greene/Rubin 1977 describe a context-only tagger with only 77% accuracy

Lexical information (dictionary): most common tag(s) for a given worde.g. in English, many nouns can be used as verbs (flour the pan, wax the car)however, their most likely tag remains NNdistribution of a words usages across different POSs is uneven: usually, one highly likely, other much lessTagging in other languages (than English)In English, high reliance on context is a good idea, because of fixed word order

Free word order languages make this assumption harderCompensation: these languages typically have rich morphologyGood source of clues for a tagger

Evaluation and error analysisTraining a statistical POS tagger requires splitting corpus into training and test data.Often, we need a development set as well, to tune parameters.

Using (n-fold) cross-validation is a good idea to save data.randomly divide data into train + testtrain and evaluate on testrepeat n times and take an average

NB: cross-validation requires the whole corpus to be blind.To examine the training data, best to have fixed training & test sets, perform cross-validation on training data, and final evaluation on test set.EvaluationTypically carried out against a gold standard based on accuracy (% correct).

Ideal to compare accuracy of our tagger with:baseline (lower-bound):standard is to choose the unigram most likely tagceiling (upper bound): e.g. see how well humans do at the same taskhumans apparently agree on 96-7% tagsmeans it is highly suspect for a tagger to get 100% accuracyPart 1HMM taggersUsing Markov modelsBasic idea: sequences of tags are a Markov Chain:Limited horizon assumption: sufficient to look at previous tag for information about current tagTime invariance: The probability of a sequence remains the same over timeImplications/limitationsLimited horizon ignores long-distance dependencese.g. cant deal with WH-constructionsChomsky (1957): this was one of the reasons cited against probabilistic approaches

Time invariance:e.g. P(finite verb|pronoun) is constantbut we may be more likely to find a finite verb following a pronoun at the start of a sentence than in the middle!

NotationWe let ti range over tagsLet wi range over wordsSubscripts denote position in a sequence

Use superscripts to denote word types:wj = an instance of word type j in the lexicontj = tag t assigned to word wj

Limited horizon property becomes:

Basic strategyTraining set of manually tagged textextract probabilities of tag sequences:

e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005

Next step: estimate the word/tag probabilities:

These are basically symbol emission probabilitiesTraining the tagger: basic algorithmEstimate probability of all possible sequences of 2 tags in the tagset from training data

For each tag tj and for each word wl estimate P(wl| tj).

Apply smoothing.Finding the best tag sequenceGiven: a sentence of n wordsFind: t1,n = the best n tags

Application of Bayes ruledenominator can be eliminated as its the same for all tag sequences.

Finding the best tag sequenceThe expression needs to be reduced to parameters that can be estimated from the training corpus

need to make some simplifying assumptionswords are independent of eachothera words identity depends only on its tag

The independence assumptionProbability of a sequence of words given a sequence of tags is computed as a function of each word independently

The identity assumption

Probability of a word given a tag sequence = probability a word given its own tagApplying these assumptions

Tagging with the Markov ModelCan use the Viterbi Algorithm to find the best sequence of tags given a sequence of words (sentence)Reminder:probability of being in state (tag) j at word i on the best path

most probable state (tag) at word i given that were in state j at word i+1

The algorithm: initialisation

Assume that P(PERIOD) = 1 at end of sentence Set all other tag probs to 0Algorithm: induction stepfor i = 1 to n step 1:for all tags tj do:

Probability of tag tj at i+1 on best path through i Most probable tag leading to tj at i+1 Algorithm: backtrace

State at n+1

for j = n to 1 do:retrieve the most probable tags for every point in sequenceCalculate probability for the sequence of tags selectedSome observationsThe model is a Hidden Markov Modelwe only observe words when we tag

In actuality, during training we have a visible Markov Modelbecause the training corpus provides words + tags True HMM taggersApplied to cases where we do not have a large training corpus

We maintain the usual MM assumptions

Initialisation: use dictionary:set emission probability for a word/tag to 0 if its not in dictionary

Training: apply to data, use forward-backward algorithm

Tagging: exactly as before

Part 2Transformation-based error-driven learningTransformation-based learningApproach proposed by Brill (1995)uses quantitative information at training stageoutcome of training is a set of rulestagging is then symbolic, using the rules

Components:a set of transformation ruleslearning algorithmTransformationsGeneral form: t1 t2replace t1 with t2 if certain conditions are satisfied

Examples:Morphological: Change the tag from NN to NNS if the word has the suffix "s"dogs_NN dogs_NNS

Syntactic: Change the tag from NN to VB if the word occurs after "to"go_NN to_TO go_VB

Lexical: Change the tag to JJ if deleting the prefix "un" results in a word.uncool_XXX uncool_JJuncle_NN -/-> uncle_JJLearningUnannotated textInitial state annotatore.g. assign each word its most frequent tag in a dictionarytruth: a manually annotated version of corpus against which to compareLearner: learns rules by comparing initial state to TruthrulesLearning algorithmSimple iterative process:apply a rule to the corpuscompare to the Truthif error rate is reduced, keep the results

A priori specifications:how initial state annotator worksthe space of possible transformationsBrill (1995) used a set of initial templatesthe function to compare the result of applying the rules to the truthNon-lexicalised rule templatesTake only tags into account, not the shape of words

Change tag a to tag b when:The preceding (following) word is tagged z.The word two before (after) is tagged z.One of the three preceding (following) words is tagged z.The preceding (following) word is tagged z and the word two before (after) is tagged w.Lexicalised rule templatesTake into account specific words in the context

Change tag a to tag b when:The preceding (following) word is w.The word two before (after) is w.The current word is w, the preceding (following) word is w2 and the preceding (following) tag is t.Morphological rule templatesUsful for completely unknown words. Sensitive to the words shape.

Change the tag of an unknown word (from X) to Y if:Deleting the prefix (suffix) x, |x| 4, results in a wordThe first (last) (1,2,3,4) characters of the word are x.Adding the character string x as a prefix (suffix) results in a word (|x| 4).Word w ever appears immediately to the left (right) of the word.Character z appears in the word.Order-dependence of rulesRules are triggered by environments satisfying their conditionsE.g. AB if preceding tag is ASuppose our sequence is AAAATwo possible forms of rule application:immediate effect: applications of the same transformation can influence eachotherresult: ABABdelayed effect: results in ABBB the rule is triggered multiple times from the same initial inputBrill (1995) opts for this solutionMore on Transformation-based taggingCan be used for unsupervised learninglike HMM-based tagging, the only info available is the allowable tags for each wordtakes advantage of the fact that most words have only one tagE.g. word can = NN in context AT ___ BEZ because most other words in this context are NNtherefore, learning algorithm would learn the rule change tag to NN in context AT ___ BEZ

Unsupervised method achieves 95.6% accuracy!!Part 3Maximum Entropy models and POS TaggingLimitations of HMMsAn HMM tagger relies on:P(tag|previous tag)P(word|tag)these are combined by multiplication

TBL includes many other useful features which are hard to model in HMM:prefixes, suffixescapitalisation

Can we combine both, i.e. have HMM-style tagging with multiple features?The rationaleIn order to tag a word, we consider its context or history h. We want to estimate a probability distribution p(h,t) from sparse data.h is encoded in terms of features (e.g. morphological features, surrounding tag features etc)

There are some constraints on these features that we discover from training data.

We want our model to make the fewest possible assumptions beyond these constraints.Motivating exampleSuppose we wanted to tag the word w.

Assume we have a set T of 45 different tags: T ={NN, JJ, NNS, NNP, VVS, VB, }

The probabilistic tagging model that makes fewest assumptions assigns a uniform distribution over the tags:

Motivating exampleSuppose we find that the possible tags for w are NN, JJ, NNS, VB.

We therefore impose our first constraint on the model:

(and the prob. of every other tag is 0)

The simplest model satisfying this constraint:

Motivating exampleWe suddenly discover that w is tagged as NN or NNS 8 out of 10 times.Model now has two constraints:

Again, we require our model to make no further assumptions. Simplest distribution leaves probabilities for all tags except NN/NNS equal:P(NN) = 4/10P(NNS) = 4/10P(JJ) = 1/10P(VB) = 1/10

Motivating exampleWe suddenly discover that verbs (VB) occur 1 in every 20 words.Model now has three constraints:

Simplest distribution is now:P(NN) = 4/10P(NNS) = 4/10P(JJ) = 3/20P(VB) = 1/20

What weve been doingMaximum entropy builds a distribution by continuously adding features.

Each feature picks out a subset of the training observations.

For each feature, we add a constraint on our total distribution.

Our task is then to find the best distribution given the constraints.Features for POS TaggingEach tagging decision for a word occurs in a specific context or history h.

For tagging, we consider as context:the word itselfmorphological properties of the wordother words surrounding the wordprevious tags

For each relevant aspect of the context hi, we can define a feature fj that allows us to learn how well that aspect is associated with a tag ti.

Probability of a tag given a context is a weighted function of the features.Features for POS TaggingIn a maximum entropy model, this information is captured by a binary or indicator feature

each feature fi has a weight i reflecting its importanceNB: each i is uniquely associated with a feature

Features for POS Tagging in Ratnaparkhi (1996)Had three sets of features, for non-rare, rare and all words:

Features for POS TaggingGiven the large number of possible features, which ones will be part of the model?We do not want redundant featuresWe do not want unreliable and rarely occurring features (avoid overfitting)Ratnaparkhi (1996) used only those features which occur 10 times or more in the training dataThe form of the modelFeatures fj and their parameters are used to compute the probability p(hi, ti):where j ranges over features & Z is a normalisation constant

Transform into a linear equation:

Conditional probabilitiesThe conditional probabilities can be computed based on the joint probabilities

Probability of a sequence of tags given a sequence of words:

NB: unlike an HMM, we have one probability here.we directly estimate p(t|h)model combines all features in hi into a single estimateno limit in principle on what features we can take into account

The use of constraintsEvery feature we have imposes a constraint or expectation on the probability model. We want:

Where:

the model ps expectation of fjthe empirical expectation of fj

Why maximum entropy?Recall that entropy is a measure of uncertainty in a distribution.Without any knowledge, simplest distribution is uniformuniform distributions have the highest entropyAs we add constraints, the MaxEnt principle dictates that we find the simplest model p* satisfying the constraints:

where P is the set of possible distributions with p* is unique and has the form given earlier

Basically, an application of Occams razor: make no further assumptions than necessary.

Part 4Training a MaxEnt modelTraining 1: computing empirical expectationRecall that:

Suppose we are interested in the feature:

In a corpus of 10k words + tags, where the word moving occurs as VBG 20 times:

Training 2: computing model expectationRecall that:

Requires sum over all possible histories and tags!

Approximate by computing model expectation of the feature on training data only:

Learning the optimal parametersOur goal is to learn the best parameter j for each feature fj, such that:

i.e.:

One method is Generalised Iterative Scaling

Generalised Iterative Scaling: assumptionsfor all (h,t), features sum to a constant value:

If this is not the case, we set C to:

and add a filler feature fl, such that:

Generalised Iterative Scaling: assumptions (II)for all (h,t), there is at least one feature f which is active, i.e.:

Generalised Iterative ScalingInput: Features f1, ..., fn and empirical distributionOutput: Optimal parameter values 1, ..., nInitialise i = 0 for all i {1, 2, ..., n}For each i do:set setIf model has not converged, repeat from (2)

Part 5Finding tag sequences with MaxEnt modelsTagging sequencesWe want to tag a sequence w1, ...,wnThis can be decomposed into:

The history hi consists of the words w1, ...,wi-1 and previous tags t1, ..., ti1

Finding the best tag sequence: beam search (Ratnaparki, 1996)To find the best sequence of n tags given N features.sij = the jth highest probability tag sequence up to word i.

Generate all tags for w1find the top N tagsset s1j for 1 j N

for i = 2 to n do:for j = 1 to N do:Generate tags for wi given s(i-1)jAppend each tag to s(i-1)j to create new sequenceFind the N highest probability sequences generated by loop 2a.

Return sn1Worked exampleSuppose our data consists of the sequence: a, b, c

Assume the correct tags are A, B, C

Assume that N = 1 (i.e. we only ever consider the top most likely tag)Worked Exampleabcs11s21s31Step 1: generate all possible tags for a: A, B, CStep 2: find the most likely tag for a: A

Worked Exampleabcs11s21s31AStep 2: generate all possible tags for b: A, B, Cmerge with s11: A-A, A-B, A-CFind the most likely sequence: A-B

Worked Exampleabcs11s21s31AA-BStep 3: generate all possible tags for w3: A, B, Cmerge with s21: A-B-A, A-B-B, A-B-CFind the most likely: A-B-C

Worked Exampleabcs11s21s31AA-BA-B-CReturn s31 (=A-B-C)

Part 6Markov Models vs. MaxEntHMM vs MaxEntStandard HMMs cannot compute conditional probability directly. E.g. for tagging:we want p(t1,n|w1,n)we obtain it via Bayes rule, combining p(w1,n|t1,n) with the prior p(t1,n)

HMMs are generative models which optimise p(w1,n|t1,n)

By contrast, a MaxEnt Markov Model (MEMM) is a discriminative model which optimises p(t1,n|w1,n) directly.Graphically (after Jurafsky & Martin 2009)

HMM has separate models for P(w|t) and for P(t)

MEMM has a single model to estimate P(t|w)More formallyWith an HMM:

With a MEMM:

Adapting ViterbiWe can easily adapt the Viterbi algorithm to find the best state sequence in a MEMM.Recall that with HMMs:

Adaptation for MEMMs:

SummaryMaxEnt is a powerful classification model with some advantages over HMM:direct computation of conditional probabilities from the training datacan handle multiple featuresFirst introduced by Ratnaparkhi (1996) for POS Tagging. Used for many other applications since then.