*Introduction to Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

56
10/19/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.46 5) Tagging, Tagsets, and Morpho logy Dr. Jan Hajič CS Dept., Johns Hopkins Univ. [email protected] www.cs.jhu.edu/~hajic

description

*Introduction to Natural Language Processing (600.465) Tagging, Tagsets, and Morphology. Dr. Jan Hajič CS Dept., Johns Hopkins Univ. [email protected] www.cs.jhu.edu/~hajic. The task of (Morphological) Tagging. Formally: A + → T - PowerPoint PPT Presentation

Transcript of *Introduction to Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

Page 1: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/Jan Hajic 1

*Introduction to Natural Language Processing (600.465)

Tagging, Tagsets, and Morphology

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 2: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic2

The task of (Morphological) Tagging

• Formally: A+ → T• A is the alphabet of phonemes (A+ denotes any non-empty seque

nce of phonemes)– often: phonemes ~ letters

• T is the set of tags (the “tagset”)

• Recall: 6 levels of language description:• phonetics ... phonology ... morphology ... syntax ... meaning ...

- a step aside:

• Recall: A+ → 2(L,C1,C2,...,Cn) → T morphology tagging: disambiguation ( ~ “select”)

tagging

Page 3: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic3

Tagging Examples

• Word form: A+ → 2(L,C1,C2,...,Cn) → T– He always books the violin concert tickets early.

• MA: books → {(book-1,Noun,Pl,-,-),(book-2,Verb,Sg,Pres,3)}

• tagging (disambiguation): ... → (Verb,Sg,Pres,3)

– ...was pretty good. However, she did not realize...• MA: However → {(however-1,Conj/coord,-,-,-),(however-2,Adv,-,-,-)}

• tagging: ... → (Conj/coord,-,-,-)

– [æ n d] [g i v] [i t] [t u:] [j u:] (“and give it to you”)• MA: [t u:] → {(to-1,Prep),(two,Num),(to-2,Part/inf),(too,Adv)}

• tagging: ... → (Prep)

Page 4: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic4

Tagsets

• General definition:– tag ~ (c1,c2,...,cn)

– often thought of as a flat list

T = {ti}i=1..n

with some assumed 1:1 mapping

T ↔ (C1,C2,...,Cn)

• English tagsets (see MS):– Penn treebank (45) (VBZ: Verb,Pres,3,sg, JJR: Adj. Comp.)

– Brown Corpus (87), Claws c5 (62), London-Lund (197)

Page 5: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic5

Other Language Tagsets

• Differences:– size (10..10k)

– categories covered (POS, Number, Case, Negation,...)

– level of detail

– presentation (short names vs. structured (“positional”))

• Example:

– Czech: AGFS3----1A----

POS

SUBPOS

GENDER

NUMBER

CASE

POSSG

POSSNPERSON

TENSEDCOMP

NEG

VOICE

VAR

Page 6: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic6

Tagging Inside Morphology

• Do tagging first, then morphology:

• Formally: A+ → T → (L,C1,C2,...,Cn)

• Rationale: – have |T| < |(L,C1,C2,...,Cn)| (thus, less work for the tagger) and k

eep the mapping T → (L,C1,C2,...,Cn) unique.

• Possible for some languages only (“English-like”)• Same effect within “regular” A+ → 2(L,C1,C2,...,Cn) → T:

– mapping R : (C1,C2,...,Cn) → Treduced,

then (new) unique mapping U: A+ ⅹTreduced → (L,T)

Page 7: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic7

Lemmatization

• Full morphological analysis:MA: A+ → 2(L,C1,C2,...,Cn)

(recall: a lemma l ∈L is a lexical unit (~ dictionary entry ref)

• Lemmatization: reduced MA:– L: A+ → 2L: w → {l; (l,t1,t2,...,tn) ∈MA(w)}

– again, need to disambiguate (want: A+ → L)

(special case of word sense disambiguation, WSD)

– “classic” tagging does not deal with lemmatization

(assumes lemmatization done afterwards somehow)

Page 8: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic8

Morphological Analysis: Methods

• Word form list• books: book-2/VBZ, book-1/NNS

• Direct coding• endings: verbreg:s/VBZ, nounreg:s/NNS, adje:er/JJR, ...• (main) dictionary: book/verbreg, book/nounreg,nic/adje:nice

• Finite state machinery (FSM)• many “lexicons”, with continuation links: reg-root-lex → reg-end-lex• phonology included but (often) clearly separated

• CFG, DATR, Unification, ...• address linguistic rather than computational phenomena• in fact, better suited for morphological synthesis (generation)

Page 9: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic9

Word Lists

• Works for English– “input” problem: repetitive hand coding

• Implementation issues:– search trees

– hash tables (Perl!)

– (letter) trie:

• Minimization? t

t

a

n

d

at,Prep

a,Art

a,Artv

ant,NNand,Conj

Page 10: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic10

Word-internal1 Segmentation (Direct)

• Strip prefixes: (un-, dis-, ...)• Repeat for all plausible endings:

– Split rest: root + ending (for every possible ending)– Find root in a dictionary, keep dictionary information

• in particular, keep inflection class (such as reg, noun-irreg-e, ...)

– Find ending, check inflection+prefix class match– If match found:

• Output root-related info (typically, the lemma(s))• Output ending-related information (typically, the tag(s)).

1Word segmentation is a different problem (Japanese, speech in general)

Page 11: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic11

Finite State Machinery

• Two-level Morphology– phonology + “morphotactics” (= morphology)

• Both components use finite-state automata:– phonology: “two-level rules”, converted to FSA

• e:0 ⇔_ +:0 e:e r:r (nice nicer)

– morphology: linked lexicons• root-dic: book/”book” end-noun-reg-dic

• end-noun-reg-dic: +s/”NNS”

• Integration of the two possible (and simple)

Page 12: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic12

Finite State Transducer

• FST is a FSA where– symbols are pairs (r:s) from a finite alphabets R and S.

• “Checking” run:– input data: sequence of pairs, output: Yes/No (accept/do not)

– use as a FSA

• Analysis run:– input data: sequence of only s ∈ S (TLM: surface);

– output: seq. of r ∈ R (TLM: lexical), + lexicon “glosses”

• Synthesis (generation) run:– same as analysis except roles are switched: S ↔R, no gloss

Page 13: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic13

FST Example• German umlaut (greatly simplified!):

u ↔ü if (but not only if) c h e r follows (Buch → Bücher)

rule: u:ü ⇐c:c h:h e:e r:r

FST:

Buch/Buch:

F1 F3 F4 F5

Buch/Bucher:

F1 F3 F4 F5 F6 N1

Buch/Buck:

F1 F3 F4 F1

u:ü

u:othc:c

h:h e:e

r:r

u:othu:oth

u:ü oth

oth

oth

u:oth

u:othu:oth

oth

oth

any

F1

F2

F3

F4

F5

F6

N1oth

Page 14: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic14

Parallel Rules, Zero Symbols

• Parallel Rules:– Each rule ~ one FST

– Run in parallel

– Any of them fails path fails

• Zero symbols (one side only, even though 0:0 o.k.)– behave like any other

e:0

+:0

F5

F6

Page 15: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic15

The Lexicon

• Ordinary FSA (“lexical” alphabet only)• Used for analysis only (NB: disadvantage of TLM):

– additional constraint:• lexical string must pass the linked lexicon list

• Implemented as a FSA; compiled from lists of strings and lexicon links

• Example:

b o o k

ka n+ s

“bank”

“book”

“NNS”

Page 16: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic16

TLM: Analysis

1. Initialize set of paths to P = {}.

2. Read input symbols, one at a time.

3. At each symbol, generate all lexical symbols possibly corresponding to the 0 symbol (voilà!).

4. Prolong all paths in P by all such possible (x:0) pairs.

5. Check each new path extension against the phonological FST and lexical FSA (lexical symbols only); delete impossible paths prefixes.

6. Repeat 4-5 until max. # of consecutive 0 reached.

Page 17: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic17

TLM: Analysis (Cont.)

7. Generate all possible lexical symbols (get from all FSTs) for the current input symbol, form pairs.

8. Extend all paths from P using all such pairs.

9. Check all paths from P (next step in FST/FSA). Delete all outright impossible paths.

10. Repeat from 3 until end of input.

11. Collect lexical “glosses” from all surviving paths.

Page 18: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic18

TLM Analysis Example• Bücher:

• suppose each surface letter corresponds to the same symbol at the lexical level, just ü might be ü as well as u lexically; plus zeroes (+:0), (0:0)

• Use the FST as before.

• Use lexicons:

root: Buch “book” end-reg-uml

Bündni “union” end-reg-s

end-reg-uml: +0 “NNomSg”

+er “NNomPl”B:B Bu:Bü Buc:Büc Buch:Büch Buch+e:Büch0e Buch+er:Büch0er

Bü:Bü Büc:Bücu

ü

Page 19: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic19

TLM: Generation

• Do not use the lexicon (well you have to put the “right” lexical strings together somehow!)

• Start with a lexical string L.• Generate all possible pairs l:s for every symbol in L.• Find all (hopefully only 1!) traversals through the FST

which end in a final state. • From all such traversals, print out the sequence of surf

ace letters.

Page 20: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic20

TLM: Some Remarks

• Parallel FST (incl. final lexicon FSA)– can be compiled into a (gigantic) FST

– maybe not so gigantic (XLT - Xerox Language Tools)

• “Double-leveling” the lexicon:– allows for generation from lemma, tag

– needs: rules with strings of unequal length

• Rule Compiler– Karttunen, Kay

• PC-KIMMO: free version from www.sil.org (Unix,too)

Page 21: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/Jan Hajic 21

*Introduction to Natural Language Processing (600.465)

Tagging: An Overview

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 22: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic22

Rule-based Disambiguation• Example after-morphology data (using Penn tagset):

I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP• Rules using

– word forms, from context & current position

– tags, from context and current position

– tag sets, from context and current position

– combinations thereof

Page 23: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic23

Example Rules I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP

• If-then style:• DTeq,-1,Tag

(implies NNin,0,Set as a condition) • PRPeq,-1,Tag and DTeq,+1,Tag VBP• {DT,NN}sub,0,Set DT• {VB,VBZ,VBP,VBD,VBG}inc,+1,Tag not DT

• Regular expressions:• not(<*,*,DTnot• not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• not(<*,{DT,NN}sub,notDT• not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)

Page 24: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic24

Implementation

• Finite State Automata– parallel (each rule ~ automaton);

• algorithm: keep all paths which cause all automata say yes

– compile into single FSA (intersection)

• Algorithm:– a version of Viterbi search, but:

• no probabilities (“categorical” rules)

• multiple input:– keep track of all possible paths

Page 25: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic25

Example: the FSA

• R1: not(<*,*,DTnot• R2: not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• R3: not(<*,{DT,NN}sub,DT• R4: not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)

• R1:

• R3:

<*,*,DT notF1 F2 N3

anything else

anything else

anything

<*,{DT,NN}sub,notDTF1 N2

anything else

anything

Page 26: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic26

Applying the FSA• R1: not(<*,*,DTnot• R2: not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• R3: not(<*,{DT,NN}sub,DT• R4: not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)

• R1 blocks: remains: or

• R2 blocks: remains e.g.: and more

• R3 blocks: remains only:• R4 R1!

I watch a

NN DT

PRP VB

a fly

DT

VB

VBP

a fly

DT NN

a fly

NN

NN VB

VBP

I watch a

DT

PRP

VBP

I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP

a

NN

a

DT

Page 27: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic27

Applying the FSA (Cont.)

• Combine:

• Result:

I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP

a fly

DT NN

a fly

NN

NN VB

VBP

I watch a

DT

PRP

VBP

a

DT

I watch a fly .

PRP VBP DT NN .

Page 28: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic28

Tagging by Parsing

• Build a parse tree from the multiple input:

• Track down rules: e.g., NP → DT NN: extract (a/DT fly/NN)

• More difficult than tagging itself; results mixed

I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP

NP

VP

S

Page 29: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic29

Statistical Methods (Overview)

• “Probabilistic”:• HMM

– Merialdo and many more (XLT)

• Maximum Entropy– DellaPietra et al., Ratnaparkhi, and others

• Rule-based:• TBEDL (Transformation Based, Error Driven Learning)

– Brill’s tagger

• Example-based– Daelemans, Zavrel, others

• Feature-based (inflective languages)• Classifier Combination (Brill’s ideas)

Page 30: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/Jan Hajic 30

*Introduction to Natural Language Processing (600.465)

HMM Tagging

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 31: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic31

Review

• Recall:– tagging ~ morphological disambiguation

– tagset VT (C1,C2,...Cn)

• Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ...

– mapping w → {t ∈VT} exists• restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn)

where A is the language alphabet, L is the set of lemmas

– extension to punctuation, sentence boundaries (treated as words)

Page 32: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic32

The Setting

• Noisy Channel setting: Input (tags) Output (words)

The channel

NNP VBZ DT... (adds “noise”) John drinks the ...

• Goal (as usual): discover “input” to the channel (T, the tag seq.) given the “output” (W, the word sequence)

– p(T|W) = p(W|T) p(T) / p(W)

– p(W) fixed (W given)...

argmaxT p(T|W) = argmaxT p(W|T) p(T)

Page 33: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic33

The Model

• Two models (d=|W|=|T| word sequence length):– p(W|T) = i=1..dp(wi|w1,...,wi-1,t1,...,td)

– p(T) = i=1..dp(ti|t1,...,ti-1)

• Too much parameters (as always)• Approximation using the following assumptions:

• words do not depend on the context

• tag depends on limited history: p(ti|t1,...,ti-1) p(ti|ti-n+1,...,ti-1)

– n-gram tag “language” model

• word depends on tag only: p(wi|w1,...,wi-1,t1,...,td) p(wi|ti)

Page 34: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic34

The HMM Model Definition

• (Almost) the general HMM:– output (words) emitted by states (not arcs)

– states: (n-1)-tuples of tags if n-gram tag model used

– five-tuple (S, s0, Y, PS, PY), where:

• S = {s0,s1,s2,...,sT} is the set of states, s0 is the initial state,

• Y = {y1,y2,...,yV} is the output alphabet (the words),

• PS(sj|si) is the set of prob. distributions of transitions

– PS(sj|si) = p(ti|ti-n+1,...,ti-1); sj = (ti-n+2,...,ti), si = (ti-n+1,...,ti-1)

• PY(yk|si) is the set of output (emission) probability distributions

– another simplification: PY(yk|si) = PY(yk|sj) if si and sj contain the same tag as the rightmost element: PY(yk|si) = p(wi|ti)

Page 35: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic35

Supervised Learning (Manually Annotated Data Available)

• Use MLE– p(wi|ti) = cwt(ti,wi) / ct(ti)

– p(ti|ti-n+1,...,ti-1) = ctn(ti-n+1,...,ti-1,ti) / ct(n-1)(ti-n+1,...,ti-1)

• Smooth (both!)– p(wi|ti): “Add 1” for all possible tag,word pairs using a

predefined dictionary (thus some 0 kept!)

– p(ti|ti-n+1,...,ti-1): linear interpolation:• e.g. for trigram model:

p’(ti|ti-2,ti-1) = 3 p(ti|ti-2,ti-1) + 2 p(ti|ti-1) + 1 p(ti) + 0 / |VT|

Page 36: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic36

Unsupervised Learning

• Completely unsupervised learning impossible– at least if we have the tagset given- how would we associate

words with tags?

• Assumed (minimal) setting:– tagset known

– dictionary/morph. analysis available (providing possible tags for any word)

• Use: Baum-Welch algorithm (see class 15, 10/13)– “tying”: output (state-emitting only, same dist. from two state

s with same “final” tag)

Page 37: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic37

Comments on Unsupervised Learning

• Initialization of Baum-Welch– is some annotated data available, use them

– keep 0 for impossible output probabilities

• Beware of:– degradation of accuracy (Baum-Welch criterion: entrop

y, not accuracy!)

– use heldout data for cross-checking

• Supervised almost always better

Page 38: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic38

Unknown Words

• “OOV” words (out-of-vocabulary)– we do not have list of possible tags for them

– and we certainly have no output probabilities

• Solutions:– try all tags (uniform distribution)

– try open-class tags (uniform, unigram distribution)

– try to “guess” possible tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitalization)

Page 39: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic39

Running the Tagger

• Use Viterbi– remember to handle unknown words

– single-best, n-best possible

• Another option:– assign always the best tag at each word, but consider all

possibilities for previous tags (no back pointers nor a path-backpass)

– introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors)

Page 40: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic40

(Tagger) Evaluation

• A must: Test data (S), previously unseen (in training)– change test data often if at all possible! (“feedback cheating”)

– Error-rate based

• Formally: – Out(w) = set of output “items” for an input “item” w

– True(w) = single correct output (annotation) for w

– Errors(S) = i=1..|S|(Out(wi) ≠ True(wi))

– Correct(S) = i=1..|S|(True(wi) ∈ Out(wi))

– Generated(S) = i=1..|S||Out(wi)|

Page 41: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic41

Evaluation Metrics

• Accuracy: Single output (tagging: each word gets a single tag)– Error rate: Err(S) = Errors(S) / |S|

– Accuracy: Acc(S) = 1 - (Errors(S) / |S|) = 1 - Err(S)

• What if multiple (or no) output?– Recall: R(S) = Correct(S) / |S|

– Precision: P(S) = Correct(S) / Generated(S)

– Combination: F measure: F = 1 / (/P + (1-)/R)• is a weight given to precision vs. recall; for =.5, F = 2PR/(R+P)

Page 42: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/Jan Hajic 42

*Introduction to Natural Language Processing (600.465)

Transformation-Based Tagging

Dr. Jan Haji?

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 43: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic43

The Task, Again

• Recall:– tagging ~ morphological disambiguation

– tagset VT (C1,C2,...Cn)

• Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ...

– mapping w → {t ∈VT} exists• restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn)

where A is the language alphabet, L is the set of lemmas

– extension to punctuation, sentence boundaries (treated as words)

Page 44: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic44

Setting

• Not a source channel view• Not even a probabilistic model (no “numbers” used

when tagging a text after a model is developed)• Statistical, yes:

• uses training data (combination of supervised [manually annotated data available] and unsupervised [plain text, large volume] training)

• learning [rules]

• criterion: accuracy (that’s what we are interested in in the end, after all!)

Page 45: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic45

The General SchemeTraining Tagging

Annotatedtraining data

Plain text training data

LEARNER

Rules learned

TAGGER

Datato annotate

Automatically tagged data

training iterations

Partially an-notated data

Page 46: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic46

The Learner

Annotatedtraining data

Remove tags

Assign initial tags

ATD without annotation

Interimannotation

Interimannotation

Interimannotation

(THE “TRUTH”)

Iteration 1 Iteration 2

Iteration n

Interimannotation

RULES

Add ru

le 1

Add

rule

2

Add

rule

n

Page 47: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic47

The I/O of an Iteration

• In (iteration i):– Intermediate data (initial or the result of previous iteration)

– The TRUTH (the annotated training data)

– [pool of possible rules]

• Out:– One rule rselected(i) to enhance the set of rules learned so far

– Intermediate data (input data transformed by the rule learned in this iteration, rselected(i))

Page 48: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic48

The Initial Assignment of Tags

• One possibility:– NN

• Another:– the most frequent tag for a given word form

• Even:– use an HMM tagger for the initial assignment

• Not particularly sensitive

Page 49: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic49

The Criterion• Error rate (or Accuracy):

– beginning of an iteration: some error rate Ein

– each possible rule r, when applied at every data position:• makes an improvement somewhere in the data (cimproved(r))

• makes it worse at some places (cworsened(r))

• and, of course, does not touch the remaining data

• Rule contribution to the improvement of the error rate:• contrib(r) = cimproved(r) - cworsened(r)

• Rule selection at iteration i:• rselected(i) = argmaxr contrib(r)

• New error rate: Eout = Ein - contrib(rselected(i))

Page 50: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic50

The Stopping Criterion

• Obvious:– no improvement can be made

• contrib(r) ≤ 0– or improvement too small

• contrib(r) ≤ Threshold

• NB: prone to overtraining!– therefore, setting a reasonable threshold advisable

• Heldout?– maybe: remove rules which degrade performance on H

Page 51: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic51

The Pool of Rules (Templates)• Format: change tag at position i from a to b / condition• Context rules (condition definition - “template”):

wi-3 wi-2 wi-1 wi wi+1 wi+2 wi+3

ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3

Instantiation: w, t permitted

Page 52: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic52

Lexical Rules

• Other type: lexical rules

• Example: – wi has suffix -ied

– wi has prefix ge-

wi-3 wi-2 wi-1 wi wi+1 wi+2 wi+3

ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3

“look inside the word”

Page 53: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic53

Rule Application

• Two possibilities:– immediate consequences (left-to-right):

• data: DT NN VBP NN VBP NN...

• rule: NN → NNS / preceded by NN VBP

• apply rule at position 4:

DT NN VBP NN VBP NN...

DT NN VBP NNS VBP NN...

• ...then rule cannot apply at position 6 (context not NN VBP).

– delayed (“fixed input”):• use original input for context

• the above rule then applies twice.

Page 54: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic54

In Other Words...

• 1. Strip the tags off the truth, keep the original truth• 2. Initialize the stripped data by some simple method• 3. Start with an empty set of selected rules S.• 4. Repeat until the stopping criterion applies:

– compute the contribution of the rule r, for each r: contrib(r) = cimproved(r) - cworsened(r)

– select r which has the biggest contribution contrib(r), add it to the final set of selected rules S.

• 5. Output the set S.

Page 55: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic55

The Tagger

• Input:– untagged data

– rules (S) learned by the learner

• Tagging:– use the same initialization as the learner did

– for i = 1..n (n - the number of rules learnt)• apply the rule i to the whole intermediate data, changing

(some) tags

– the last intermediate data is the output.

Page 56: *Introduction to  Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic56

N-best & Unsupervised Modifications

• N-best modification– allow adding tags by rules

– criterion: optimal combination of accuracy and the number of tags per word (we want: close to ↓1)

• Unsupervised modification– use only unambiguous words for evaluation criterion

– work extremely well for English

– does not work for languages with few unambiguous words