Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network
description
Transcript of Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network
Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer
Stanford UniversityThe Hebrew University of Jerusalem
Highlights Just using P(t|w) works even better than
you thought—using a better unknown word model
You can tag really well with no sequence model at all
Conditioning on BOTH left AND right tags yields best published tagging performance
If you are using a maxent model: Use proper smoothing Consider more lexicalization Use conjunctions of features
Sequential Classifiers Learn classifiers for local decisions – predict
the tag of a word based on features we like – neighboring words, tags, etc.
Combine the decisions of the classifiers using their output probabilities or scores and chose the best global tag sequence
t0
w0w-1 w1
t1t-1
When the dependencies are not cyclic and the classifier is probabilistic, this corresponds to a Bayesian Network (CMM)
Experiments for Part-of-Speech Tagging
Data – WSJ 0-18 training, 19-21 dev, 22-24 test Log-linear models for local distributions
All features are binary and formed by instantiating templates
f1(h,t)=1, iff w0=“to” and t=TO (0 otherwise)
Separate feature templates targeted at unknown words -- prefixes, suffixes,etc.
K
kjk
m
jj
j
m
jj
thf
thfhtP
1 1
1
)),(exp(
)),(exp()|(
Tagging Without Sequence Information
t0
w0
Baselinet0
w0w-1 w1
Three Words
Model Features
Token Unknown
Sentence
Baseline 56,805 93.69% 82.61% 26.74%3Words 239,76
796.57% 86.78% 48.27%Using words only works significantly better than
using the previous two or three tags!
CMM Tagging Models - I
Independence Assumptions of Left-to-Right CMM
• ti is independent of t1…ti-2 and w1…wi-1 given ti-1
• ti is independent of all following observations
Similar assumptions in the Right-to-Left CMM
• ti is independent of all preceding observations
t2
w2
t1 t3
w3w1
t2
w2
t1 t3
w3w1
CMM Tagging Models - II The bad independence assumptions lead to
label bias (Bottou 91, Lafferty 01) and observation bias (Klein & Manning 02)
will {MD, NN} to {TO} fight {NN, VB, VBP}will will be mis-tagged as MD, because MD is the
most common tagging
TO
to
t1 t3
fightwill
P(t1=MD,t2=TO|will,to)=P(MD|will,sos)*P(TO|to,MD)=P(MD|will,sos)*1
CMM Tagging Models - III
will {MD, NN} to {TO} fight {NN, VB, VBP}In the Right-to-Left CMM, fight will most likely be
mis-tagged as NN
TO
to
t1 t3
fightwill
P(t2=TO,t3=NN|to,fight)=P(NN|fight,X)*P(TO|to,NN)=P(NN|fight,X)*1
Dependency NetworksConditioning on both left and right tags
fixes the problem
TO
to
t1 t3
fightwill
Dependency Networks
),,|(),...,|,...,( 111
11 iii
n
iinn wtttPwwttScore
We do not attempt to construct a joint distribution.
We classify to the highest scoring sequenceEfficient dynamic programming algorithm similar to Viterbi exists for finding the most likely sequence
t2
w2
t1
w1
Inference for Linear Dependency Networks
ti
wi
ti-1
wi-1
ti+2
wi+2
ti+1
wi+1
),,|(),,,(max),,,1(
1112
11
2iiiiiiit
iii
wtttPtttibestScoretttibestScore
i
Using Tags: Left Context is Better
t0
w0
Baselinet0
w0
t-1
Model L Model Rt1t0
w0
Model Features Token Unknown
Sentence
Baseline 56,805 93.69% 82.61% 26.74%L 27,474 95.79% 85.49% 41.89%R 27,648 95.14% 85.65% 36.31%Model L has 13.4% error reduction from Model R
Centered Context is Better
t0
w0
t-1 t1t0
w0
t-2
L+L2
t2
R+R2
t0t-1
w0
t1
L+R
Model Features Token Unknown
Sentence
L+L2 32,935 96.05% 85.92% 44.04%R+R2 33,423 95.25% 84.49% 37.20%L+R 32,610 96.57% 87.15% 49.50%Model L+R has 13.2% error reduction from Model L+L2
Centered Context is Better in the End
t0
w0
t-1t-2t-3
L+LL+LLL
t1
w0
t0t-1t-2
L+LL+LR+R+RR
t2
Model Features Token Unknown
Sentence
L+LL+LLL
118,752 96.20% 86.52% 45.14%
L+LL+LR+R+RR
81,049 96.92% 87.91% 53.23%15% error reduction due to including right word tags
Lexicalization and More Unknown Word Features
t1t0t-1t-2 t2
w0w-1 w+1
L+LL+LR+R+RR+3W
Model Features Token Unknown SentenceL+LL+LR+R+RR(TAGS)
81,049 96.92% 87.91% 53.23%
TAGS+3W 263,160 97.02% 88.05% 53.83%TAGS+3W+LW0+RW0+W-1W0+W0W1 (BEST)
460,552 97.15% 88.61% 55.83%
BEST test set 460,552 97.24% 89.04% 56.34%
Final Test ResultsModel Features Token Unknown SentenceBEST test set 460,552 97.24% 89.04% 56.34%
2,51%
2,71%
2,90%
Token Error Rate on Test Set
UsCollins
Comparison to best published results – Collins 02
4.4% error reduction Statistically significant
Unknown Word Features Because we use a conditional model, it
is easy to define complex features of the words A crude company name detector --- the
feature is on if the word is capitalized and followed by a company name suffix like Co. or Inc within 3 words.
Conjunctions of character level features – capitalized, contains digit, contains dash, all capitalized, etc. (ex. CFC-12 F/A-18)
Prefixes and suffixes up to length 10
Regularization Helps a Lot
Higher accuracy, faster convergence, more features can be added before overfitting
m
jji
n
ii htPDObjective
1
22
1 21)|(log)(
Regularization Helps a Lot
96%
96%
97%
97%
97%
97%
Token Accuracy
SmoothingNo Smoothing
83%84%85%86%87%88%89%
Unknwon Accuracy
SmoothingNo Smoothing
Accuracy with and without Gaussian smoothing
Effect of reducing feature support cutoffs in smoothed and un-smoothed models
96,50%96,55%96,60%96,65%96,70%96,75%
Token Accuracy ofUnsmoothed Model
Cutoff=0Cutoff=5
96,80%
96,85%
96,90%
96,95%
97,00%
Token Accuracy ofSmoothed Model
Cutoff=1Cutoff=5
Semantics of Dependency Networks
Let X=(X1,…,Xn). A dependency network for X is a pair (G,P) where G is a cyclic dependency graph and P is a set of probability distributions.
Each node in G corresponds to a variable Xi and the parents of Xi are all nodes Pa(Xi), such that P(Xi | X1,.. Xi-1, Xi+1,.., Xn)= P(Xi |Pa(Xi))
The distributions in P are the local probability distributions p(Xi |Pa(Xi)). If there exists a joint distribution P(X) such that the conditional distributions in P are derivable from it, then the dependency network is called consistent
For positive distributions P, we can obtain the joint distribution P(X) by Gibbs sampling
Hofmann and Tresp (1997) Heckerman (2000)
Dependency Networks - Problems
The dependency network probabilities learned from data may be inconsistent – there may not be a joint distribution having these conditionals
Even if they define a consistent network, the scoring criterion is susceptible to mutually re-enforcing but unlikely sequences
Suppose we have the following sequence of observations <11,11,12,33>
Most likely state is <11> , but Score(11)=2/3*1=2/3 and Score(33)=1
a b
Conclusions The use of dependency networks was
very helpful for tagging both left and right words and tags are used
for prediction, avoids bad independence assumptions
in training and test, the time/space complexity is the same as for CMMs
Promising for other NLP sequence tasks More predictive features for tagging
Rich lexicalization further improved accuracy Conjunctions of feature templates
Smoothing is critical