Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network

Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer

Stanford UniversityThe Hebrew University of Jerusalem

Highlights Just using P(t|w) works even better than

you thought—using a better unknown word model

You can tag really well with no sequence model at all

Conditioning on BOTH left AND right tags yields best published tagging performance

If you are using a maxent model: Use proper smoothing Consider more lexicalization Use conjunctions of features

Sequential Classifiers Learn classifiers for local decisions – predict

the tag of a word based on features we like – neighboring words, tags, etc.

Combine the decisions of the classifiers using their output probabilities or scores and chose the best global tag sequence

t0

w0w-1 w1

t1t-1

When the dependencies are not cyclic and the classifier is probabilistic, this corresponds to a Bayesian Network (CMM)

Experiments for Part-of-Speech Tagging

Data – WSJ 0-18 training, 19-21 dev, 22-24 test Log-linear models for local distributions

All features are binary and formed by instantiating templates

f1(h,t)=1, iff w0=“to” and t=TO (0 otherwise)

Separate feature templates targeted at unknown words -- prefixes, suffixes,etc.

K

kjk

m

jj

j

m

jj

thf

thfhtP

1 1

1

)),(exp(

)),(exp()|(

Tagging Without Sequence Information

t0

w0

Baselinet0

w0w-1 w1

Three Words

Model Features

Token Unknown

Sentence

Baseline 56,805 93.69% 82.61% 26.74%3Words 239,76

796.57% 86.78% 48.27%Using words only works significantly better than

using the previous two or three tags!

CMM Tagging Models - I

Independence Assumptions of Left-to-Right CMM

• ti is independent of t1…ti-2 and w1…wi-1 given ti-1

• ti is independent of all following observations

Similar assumptions in the Right-to-Left CMM

• ti is independent of all preceding observations

t2

w2

t1 t3

w3w1

t2

w2

t1 t3

w3w1

CMM Tagging Models - II The bad independence assumptions lead to

label bias (Bottou 91, Lafferty 01) and observation bias (Klein & Manning 02)

will {MD, NN} to {TO} fight {NN, VB, VBP}will will be mis-tagged as MD, because MD is the

most common tagging

TO

to

t1 t3

fightwill

P(t1=MD,t2=TO|will,to)=P(MD|will,sos)*P(TO|to,MD)=P(MD|will,sos)*1

CMM Tagging Models - III

will {MD, NN} to {TO} fight {NN, VB, VBP}In the Right-to-Left CMM, fight will most likely be

mis-tagged as NN

TO

to

t1 t3

fightwill

P(t2=TO,t3=NN|to,fight)=P(NN|fight,X)*P(TO|to,NN)=P(NN|fight,X)*1

Dependency NetworksConditioning on both left and right tags

fixes the problem

TO

to

t1 t3

fightwill

Dependency Networks

),,|(),...,|,...,( 111

11 iii

n

iinn wtttPwwttScore

We do not attempt to construct a joint distribution.

We classify to the highest scoring sequenceEfficient dynamic programming algorithm similar to Viterbi exists for finding the most likely sequence

t2

w2

t1

w1

Inference for Linear Dependency Networks

ti

wi

ti-1

wi-1

ti+2

wi+2

ti+1

wi+1

),,|(),,,(max),,,1(

1112

11

2iiiiiiit

iii

wtttPtttibestScoretttibestScore

i

Using Tags: Left Context is Better

t0

w0

Baselinet0

w0

t-1

Model L Model Rt1t0

w0

Model Features Token Unknown

Sentence

Baseline 56,805 93.69% 82.61% 26.74%L 27,474 95.79% 85.49% 41.89%R 27,648 95.14% 85.65% 36.31%Model L has 13.4% error reduction from Model R

Centered Context is Better

t0

w0

t-1 t1t0

w0

t-2

L+L2

t2

R+R2

t0t-1

w0

t1

L+R


Sentence

L+L2 32,935 96.05% 85.92% 44.04%R+R2 33,423 95.25% 84.49% 37.20%L+R 32,610 96.57% 87.15% 49.50%Model L+R has 13.2% error reduction from Model L+L2

Centered Context is Better in the End

t0

w0

t-1t-2t-3

L+LL+LLL

t1

w0

t0t-1t-2

L+LL+LR+R+RR

t2


Sentence

L+LL+LLL

118,752 96.20% 86.52% 45.14%

L+LL+LR+R+RR

81,049 96.92% 87.91% 53.23%15% error reduction due to including right word tags

Lexicalization and More Unknown Word Features

t1t0t-1t-2 t2

w0w-1 w+1

L+LL+LR+R+RR+3W

Model Features Token Unknown SentenceL+LL+LR+R+RR(TAGS)

81,049 96.92% 87.91% 53.23%

TAGS+3W 263,160 97.02% 88.05% 53.83%TAGS+3W+LW0+RW0+W-1W0+W0W1 (BEST)

460,552 97.15% 88.61% 55.83%

BEST test set 460,552 97.24% 89.04% 56.34%

Final Test ResultsModel Features Token Unknown SentenceBEST test set 460,552 97.24% 89.04% 56.34%

2,51%

2,71%

2,90%

Token Error Rate on Test Set

UsCollins

Comparison to best published results – Collins 02

4.4% error reduction Statistically significant

Unknown Word Features Because we use a conditional model, it

is easy to define complex features of the words A crude company name detector --- the

feature is on if the word is capitalized and followed by a company name suffix like Co. or Inc within 3 words.

Conjunctions of character level features – capitalized, contains digit, contains dash, all capitalized, etc. (ex. CFC-12 F/A-18)

Prefixes and suffixes up to length 10

Regularization Helps a Lot

Higher accuracy, faster convergence, more features can be added before overfitting

m

jji

n

ii htPDObjective

1

22

1 21)|(log)(

Regularization Helps a Lot

96%

96%

97%

97%

97%

97%

Token Accuracy

SmoothingNo Smoothing

83%84%85%86%87%88%89%

Unknwon Accuracy

SmoothingNo Smoothing

Accuracy with and without Gaussian smoothing

Effect of reducing feature support cutoffs in smoothed and un-smoothed models

96,50%96,55%96,60%96,65%96,70%96,75%

Token Accuracy ofUnsmoothed Model

Cutoff=0Cutoff=5

96,80%

96,85%

96,90%

96,95%

97,00%

Token Accuracy ofSmoothed Model

Cutoff=1Cutoff=5

Semantics of Dependency Networks

Let X=(X1,…,Xn). A dependency network for X is a pair (G,P) where G is a cyclic dependency graph and P is a set of probability distributions.

Each node in G corresponds to a variable Xi and the parents of Xi are all nodes Pa(Xi), such that P(Xi | X1,.. Xi-1, Xi+1,.., Xn)= P(Xi |Pa(Xi))

The distributions in P are the local probability distributions p(Xi |Pa(Xi)). If there exists a joint distribution P(X) such that the conditional distributions in P are derivable from it, then the dependency network is called consistent

For positive distributions P, we can obtain the joint distribution P(X) by Gibbs sampling

Hofmann and Tresp (1997) Heckerman (2000)

Dependency Networks - Problems

The dependency network probabilities learned from data may be inconsistent – there may not be a joint distribution having these conditionals

Even if they define a consistent network, the scoring criterion is susceptible to mutually re-enforcing but unlikely sequences

Suppose we have the following sequence of observations <11,11,12,33>

Most likely state is <11> , but Score(11)=2/3*1=2/3 and Score(33)=1

a b

Conclusions The use of dependency networks was

very helpful for tagging both left and right words and tags are used

for prediction, avoids bad independence assumptions

in training and test, the time/space complexity is the same as for CMMs

Promising for other NLP sequence tasks More predictive features for tagging

Rich lexicalization further improved accuracy Conjunctions of feature templates

Smoothing is critical

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network

Documents

Transcript of Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network