Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag...
-
Upload
hubert-gordon -
Category
Documents
-
view
214 -
download
0
Transcript of Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag...
Albert Gatt
Corpora and Statistical Methods
POS TaggingAssign each word in continuous text a tag
indicating its part of speech.Essentially a classification problem.
Current state of the art:taggers typically have 96-97% accuracyfigure evaluated on a per-word basisin a corpus with sentences of average length
20 words, 96% accuracy can mean one tagging error per sentence
Sources of difficulty in POS taggingMostly due to ambiguity when words have
more than one possible tag.need context to make a good guess about
POScontext alone won’t suffice
A simple approach which assigns only the most common tag to each word performs with 90% accuracy!
The information sources1. Syntagmatic information: the tags of other words in
the context of w Not sufficient on its own. E.g. Greene/Rubin 1977 describe
a context-only tagger with only 77% accuracy
2. Lexical information (“dictionary”): most common tag(s) for a given word
e.g. in English, many nouns can be used as verbs (flour the pan, wax the car…)
however, their most likely tag remains NN distribution of a word’s usages across different POSs is
uneven: usually, one highly likely, other much less
Tagging in other languages (than English)In English, high reliance on context is a
good idea, because of fixed word order
Free word order languages make this assumption harderCompensation: these languages typically
have rich morphologyGood source of clues for a tagger
Evaluation and error analysis Training a statistical POS tagger requires splitting
corpus into training and test data.Often, we need a development set as well, to tune
parameters.
Using (n-fold) cross-validation is a good idea to save data.randomly divide data into train + testtrain and evaluate on testrepeat n times and take an average
NB: cross-validation requires the whole corpus to be blind.To examine the training data, best to have fixed
training & test sets, perform cross-validation on training data, and final evaluation on test set.
EvaluationTypically carried out against a gold
standard based on accuracy (% correct).
Ideal to compare accuracy of our tagger with:baseline (lower-bound):
standard is to choose the unigram most likely tagceiling (upper bound):
e.g. see how well humans do at the same taskhumans apparently agree on 96-7% tagsmeans it is highly suspect for a tagger to get
100% accuracy
Part 1
HMM taggers
Using Markov modelsBasic idea: sequences of tags are a Markov
Chain:Limited horizon assumption: sufficient to look
at previous tag for information about current tag
Time invariance: The probability of a sequence remains the same over time
Implications/limitationsLimited horizon ignores long-distance
dependencese.g. can’t deal with WH-constructionsChomsky (1957): this was one of the reasons
cited against probabilistic approaches
Time invariance:e.g. P(finite verb|pronoun) is constantbut we may be more likely to find a finite verb
following a pronoun at the start of a sentence than in the middle!
NotationWe let ti range over tags
Let wi range over wordsSubscripts denote position in a sequence
Use superscripts to denote word types:wj = an instance of word type j in the lexicontj = tag t assigned to word wj
Limited horizon property becomes:)|()|( 1,..,11 iiii ttPttP
Basic strategyTraining set of manually tagged text
extract probabilities of tag sequences:
e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005
Next step: estimate the word/tag probabilities:
)(
),()|(
j
jkjk
tC
ttCttP
)(
),()|(
j
jljl
tC
twCtwP
These are basically symbol emission
probabilities
Training the tagger: basic algorithm1. Estimate probability of all possible
sequences of 2 tags in the tagset from training data
2. For each tag tj and for each word wl estimate P(wl| tj).
3. Apply smoothing.
Finding the best tag sequenceGiven: a sentence of n wordsFind: t1,n = the best n tags
Application of Bayes’ ruledenominator can be eliminated as it’s the same
for all tag sequences.
)()|(maxarg)(
)()|(maxarg)|(maxarg
,1,1,1
,1
,1,1,1,1,1
,1
,1,
nnnt
n
nnn
tnn
t
tPtwPwP
tPtwPwtP
n
nni
Finding the best tag sequence The expression needs to be reduced to
parameters that can be estimated from the training corpus
need to make some simplifying assumptions
1. words are independent of eachother2. a word’s identity depends only on its tag
The independence assumption
• Probability of a sequence of words given a sequence of tags is computed as a function of each word independently
)|(...)|()|()()|( 1211
,1,1,1,1 ttPttPtwPtPtwP nn
n
ininnn
The identity assumption
)|(...)|()|(
)|(...)|()|()()|(
1211
1211
,1,1,1,1
ttPttPtwP
ttPttPtwPtPtwP
nn
n
iii
nn
n
ininnn
Probability of a word given a tag sequence = probability a word given its own tag
Applying these assumptions
)()|(maxarg)(
)()|(maxarg)|(maxarg
,1,1,1
,1
,1,1,1,1,1
,1
,1,
nnnt
n
nnn
tnn
t
tPtwPwP
tPtwPwtP
n
nni
n
iiiii
tttPtwP
n 11)|()|(maxarg
,1
Tagging with the Markov ModelCan use the Viterbi Algorithm to find the best
sequence of tags given a sequence of words (sentence)
Reminder:
probability of being in state (tag) j at word i on the best path
most probable state (tag) at word i given that we’re in state j
at word i+1
)( ji
1i
The algorithm: initialisation
t other all for t
PERIOD
i
i
0)(
0.1)(
Assume that P(PERIOD) = 1 at end of sentence
Set all other tag probs to 0
Algorithm: induction step
for i = 1 to n step 1:for all tags tj do:
)]|()|()([maxarg)( 11j
ikjk
iTki
ji twPttPtt
)]|()|()([max)( 11
1j
ikjk
iTk
ji twPttPtt
Probability of tag tj at i+1 on best path through i
Most probable tag leading to tj at i+1
Algorithm: backtrace
)(max)...( 11
1j
nTj
n tXXP
)(maxarg 11
1j
nTj
n tX
State at n+1
)( 11 jjj XX for j = n to 1 do:
retrieve the most probable tags for every point in sequence
Calculate probability for the sequence of tags selected
Some observationsThe model is a Hidden Markov Model
we only observe words when we tag
In actuality, during training we have a visible Markov Modelbecause the training corpus provides words
+ tags
“True” HMM taggersApplied to cases where we do not have a large
training corpus
We maintain the usual MM assumptions
Initialisation: use dictionary:set emission probability for a word/tag to 0 if it’s not in
dictionary
Training: apply to data, use forward-backward algorithm
Tagging: exactly as before
Part 2
Transformation-based error-driven learning
Transformation-based learningApproach proposed by Brill (1995)
uses quantitative information at training stage
outcome of training is a set of rulestagging is then symbolic, using the rules
Components:a set of transformation ruleslearning algorithm
Transformations General form: t1 t2
“replace t1 with t2 if certain conditions are satisfied”
Examples:Morphological: Change the tag from NN to NNS if the
word has the suffix "s"dogs_NN dogs_NNS
Syntactic: Change the tag from NN to VB if the word occurs after "to"go_NN to_TO go_VB
Lexical: Change the tag to JJ if deleting the prefix "un" results in a word.uncool_XXX uncool_JJuncle_NN -/-> uncle_JJ
Learning
Unannotated text
Initial state annotatore.g. assign each word its most frequent tag in a dictionary
truth: a manually annotated version of corpus against which to compare
Learner: learns rules by comparing initial state to Truth
rules
Learning algorithmSimple iterative process:
apply a rule to the corpuscompare to the Truthif error rate is reduced, keep the results
A priori specifications:how initial state annotator worksthe space of possible transformations
Brill (1995) used a set of initial templatesthe function to compare the result of applying the
rules to the truth
Non-lexicalised rule templates Take only tags into account, not the shape of words
Change tag a to tag b when:1. The preceding (following) word is tagged z.2. The word two before (after) is tagged z.3. One of the three preceding (following) words is tagged
z.4. The preceding (following) word is tagged z and the
word two before (after) is tagged w.5. …
Lexicalised rule templates Take into account specific words in the
context
Change tag a to tag b when:1. The preceding (following) word is w.2. The word two before (after) is w.3. The current word is w, the preceding
(following) word is w2 and the preceding (following) tag is t.
4. …
Morphological rule templates Usful for completely unknown words. Sensitive to the
word’s “shape”.
Change the tag of an unknown word (from X) to Y if:1. Deleting the prefix (suffix) x, |x| ≤ 4, results in a word2. The first (last) (1,2,3,4) characters of the word are x.3. Adding the character string x as a prefix (suffix)
results in a word (|x| ≤ 4).4. Word w ever appears immediately to the left (right) of
the word.5. Character z appears in the word.6. …
Order-dependence of rulesRules are triggered by environments
satisfying their conditionsE.g. “AB if preceding tag is A”Suppose our sequence is “AAAA”Two possible forms of rule application:
immediate effect: applications of the same transformation can influence eachotherresult: ABAB
delayed effect: results in ABBB the rule is triggered multiple times from the same
initial inputBrill (1995) opts for this solution
More on Transformation-based taggingCan be used for unsupervised learning
like HMM-based tagging, the only info available is the allowable tags for each word
takes advantage of the fact that most words have only one tag
E.g. word can = NN in context AT ___ BEZ because most other words in this context are NN
therefore, learning algorithm would learn the rule “change tag to NN in context AT ___ BEZ”
Unsupervised method achieves 95.6% accuracy!!
Part 3
Maximum Entropy models and POS Tagging
Limitations of HMMs An HMM tagger relies on:
P(tag|previous tag)P(word|tag)these are combined by multiplication
TBL includes many other useful features which are hard to model in HMM:prefixes, suffixescapitalisation…
Can we combine both, i.e. have HMM-style tagging with multiple features?
The rationale In order to tag a word, we consider its context or
“history” h. We want to estimate a probability distribution p(h,t) from sparse data.h is encoded in terms of features (e.g. morphological
features, surrounding tag features etc)
There are some constraints on these features that we discover from training data.
We want our model to make the fewest possible assumptions beyond these constraints.
Motivating exampleSuppose we wanted to tag the word w.
Assume we have a set T of 45 different tags: T ={NN, JJ, NNS, NNP, VVS, VB, …}
The probabilistic tagging model that makes fewest assumptions assigns a uniform distribution over the tags:
45
1)(: tPTt
Motivating exampleSuppose we find that the possible tags for w
are NN, JJ, NNS, VB.
We therefore impose our first constraint on the model:
(and the prob. of every other tag is 0)
The simplest model satisfying this constraint:
1)()()()( VBpNNSpJJpNNp
4
1)()()()( VBpNNSpJJpNNp
Motivating exampleWe suddenly discover that w is tagged as NN or NNS 8
out of 10 times.Model now has two constraints:
Again, we require our model to make no further assumptions. Simplest distribution leaves probabilities for all tags except NN/NNS equal:P(NN) = 4/10P(NNS) = 4/10P(JJ) = 1/10P(VB) = 1/10
1)()()()( VBPNNSPJJPNNP
10
8)or and ( NNStNNtwwordP
Motivating exampleWe suddenly discover that verbs (VB) occur 1 in every
20 words.Model now has three constraints:
Simplest distribution is now:P(NN) = 4/10P(NNS) = 4/10P(JJ) = 3/20P(VB) = 1/20
1)()()()( VBPNNSPJJPNNP
10
8)or and ( NNStNNtwwordP
20
1)( VBP
What we’ve been doingMaximum entropy builds a distribution by
continuously adding features.
Each feature picks out a subset of the training observations.
For each feature, we add a constraint on our total distribution.
Our task is then to find the best distribution given the constraints.
Features for POS Tagging Each tagging decision for a word occurs in a specific
context or “history” h.
For tagging, we consider as context:the word itselfmorphological properties of the wordother words surrounding the wordprevious tags
For each relevant aspect of the context hi, we can define a feature fj that allows us to learn how well that aspect is associated with a tag ti.
Probability of a tag given a context is a weighted function of the features.
Features for POS Tagging In a maximum entropy model, this information is captured
by a binary or indicator feature
each feature fi has a weight αi reflecting its importanceNB: each αi is uniquely associated with a feature
otherwise 0
& "")suffix(w if 1),( i VBGting
thf iiij
otherwise 0
& & if 1),( 12 VBtNNtDETt
thf iiiiij
Features for POS Tagging in Ratnaparkhi (1996)Had three sets of features, for non-rare,
rare and all words:
Features for POS TaggingGiven the large number of possible
features, which ones will be part of the model?
We do not want redundant featuresWe do not want unreliable and rarely
occurring features (avoid overfitting)Ratnaparkhi (1996) used only those
features which occur 10 times or more in the training data
The form of the model
Features fj and their parameters are used to compute the probability p(hi, ti):
where j ranges over features & Z is a normalisation constant
Transform into a linear equation:
j
thfjii
iij
Zthp ),(1
),(
j
jiijii thfZthp log),(log),(log
Conditional probabilities The conditional probabilities can be computed based on the joint
probabilities
Probability of a sequence of tags given a sequence of words:
NB: unlike an HMM, we have one probability here. we directly estimate p(t|h) model combines all features in hi into a single estimate no limit in principle on what features we can take into account
'' ),(
),()|(
iii
iiii
thp
thphtp
n
iiinn htpwwttp
111 )|(),...,|,...,(
The use of constraintsEvery feature we have imposes a constraint
or expectation on the probability model. We want:
Where:
)()( jj fEfE
),(1
)(1
ii
n
ijj thf
nfE
the model p’s expectation of fj
the empirical expectation of fj
th
jj thfthpfE,
),(),()(
Why maximum entropy? Recall that entropy is a measure of uncertainty in a distribution. Without any knowledge, simplest distribution is uniform
uniform distributions have the highest entropy As we add constraints, the MaxEnt principle dictates that we find
the simplest model p* satisfying the constraints:
where P is the set of possible distributions with p* is unique and has the form given earlier
Basically, an application of Occam’s razor: make no further assumptions than necessary.
)(maxarg* pHp Pp
)()( jj fEfE
Part 4
Training a MaxEnt model
Training 1: computing empirical expectation
Recall that:
Suppose we are interested in the feature:
In a corpus of 10k words + tags, where the word moving occurs as VBG 20 times:
),(1
1ii
n
ijj thf
nfE
otherwise 0
& "" wif 1),( i VBGtmoving
thf iiij
10000
11
002.010000
20),(
10000
1),(
1
iiij
n
iiijj thfthf
nfE
Training 2: computing model expectation
Recall that:
Requires sum over all possible histories and tags!
Approximate by computing model expectation of the feature on training data only:
th
jj thfthpfE,
),(),()(
t
iji
N
iij thfhtphpfE ),()|()()(
1
Learning the optimal parameters
Our goal is to learn the best parameter αj for each feature fj, such that:
i.e.:
One method is Generalised Iterative Scaling
)()( jj fEfE
t
iji
N
iiii
n
ij thfhtphpthf
n),()|()(),(
1
11
Generalised Iterative Scaling: assumptions
for all (h,t), features sum to a constant value:
If this is not the case, we set C to:
and add a filler feature fl, such that:
Cthfj
j ),(
j
jth thfC ),(max ),(
j
jl thfCthfth ),(),(:),(
Generalised Iterative Scaling: assumptions (II)
for all (h,t), there is at least one feature f which is active, i.e.:
1),(::),( thffth
Generalised Iterative Scaling
Input: Features f1, ..., fn and empirical distribution
Output: Optimal parameter values α1, ..., αn
1. Initialise αi = 0 for all i Є {1, 2, ..., n}
2. For each i do:a.set b.set
3. If model has not converged, repeat from (2)
)(
)(log
1
i
ii fE
fE
C
iii
Part 5
Finding tag sequences with MaxEnt models
Tagging sequences
We want to tag a sequence w1, ...,wn
This can be decomposed into:
The history hi consists of the words w1, ...,wi-1 and previous tags t1, ..., ti−1
n
iiinn htpwwttp
11,1 )|(),...,|...,(
Finding the best tag sequence: beam search (Ratnaparki, 1996) To find the best sequence of n tags given N
features. sij = the jth highest probability tag sequence up to
word i.
1. Generate all tags for w1
a. find the top N tagsb. set s1j for 1 ≤ j ≤ N
2. for i = 2 to n do:a. for j = 1 to N do:
i. Generate tags for wi given s(i-1)j
ii. Append each tag to s(i-1)j to create new sequenceb. Find the N highest probability sequences generated
by loop 2a.
3. Return sn1
Worked exampleSuppose our data consists of the sequence:
a, b, c
Assume the correct tags are A, B, C
Assume that N = 1 (i.e. we only ever consider the top most likely tag)
Worked Example
a b c
s11 s21 s31
Step 1: generate all possible tags for a: A, B, CStep 2: find the most likely tag for a: A
Worked Example
a b c
s11 s21 s31
A
Step 2: generate all possible tags for b: A, B, C merge with s11: A-A, A-B, A-C Find the most likely sequence: A-B
Worked Example
a b c
s11 s21 s31
A A-B
Step 3: generate all possible tags for w3: A, B, C merge with s21: A-B-A, A-B-B, A-B-C Find the most likely: A-B-C
Worked Example
a b c
s11 s21 s31
A A-B A-B-C
Return s31 (=A-B-C)
Part 6
Markov Models vs. MaxEnt
HMM vs MaxEntStandard HMMs cannot compute
conditional probability directly. E.g. for tagging:
we want p(t1,n|w1,n)we obtain it via Bayes’ rule, combining p(w1,n|t1,n)
with the prior p(t1,n)
HMMs are generative models which optimise p(w1,n|t1,n)
By contrast, a MaxEnt Markov Model (MEMM) is a discriminative model which optimises p(t1,n|w1,n) directly.
Graphically (after Jurafsky & Martin 2009)
HMM has separate models for P(w|t) and for P(t)
MEMM has a single model to estimate P(t|w)
More formally…With an HMM:
With a MEMM:
n
iiii
tnn wttPwtP
n 11,1,1 ),|(maxarg)|(
,1
n
iiiii
tnn ttPtwPwtP
n 11,1,1 )|()|(maxarg)|(
,1
Adapting ViterbiWe can easily adapt the Viterbi algorithm to
find the best state sequence in a MEMM.Recall that with HMMs:
Adaptation for MEMMs:
),|()(max)1( 1 tijii
j ossPtt
)|()|()(max
)(max)1(
1
1
jtijii
ijoijii
j
soPssPt
battt
SummaryMaxEnt is a powerful classification model
with some advantages over HMM:direct computation of conditional
probabilities from the training datacan handle multiple features
First introduced by Ratnaparkhi (1996) for POS Tagging. Used for many other applications since then.