*Introduction to Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

10/19/1999 JHU CS 600.465/Jan Hajic 1

*Introduction to Natural Language Processing (600.465)

Tagging, Tagsets, and Morphology

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

10/19/1999 JHU CS 600.465/ Intro to NLP/Ja

n Hajic2

The task of (Morphological) Tagging

• Formally: A+ → T• A is the alphabet of phonemes (A+ denotes any non-empty seque

nce of phonemes)– often: phonemes ~ letters

• T is the set of tags (the “tagset”)

• Recall: 6 levels of language description:• phonetics ... phonology ... morphology ... syntax ... meaning ...

- a step aside:

• Recall: A+ → 2(L,C1,C2,...,Cn) → T morphology tagging: disambiguation ( ~ “select”)

tagging


n Hajic3

Tagging Examples

• Word form: A+ → 2(L,C1,C2,...,Cn) → T– He always books the violin concert tickets early.

• MA: books → {(book-1,Noun,Pl,-,-),(book-2,Verb,Sg,Pres,3)}

• tagging (disambiguation): ... → (Verb,Sg,Pres,3)

– ...was pretty good. However, she did not realize...• MA: However → {(however-1,Conj/coord,-,-,-),(however-2,Adv,-,-,-)}

• tagging: ... → (Conj/coord,-,-,-)

– [æ n d] [g i v] [i t] [t u:] [j u:] (“and give it to you”)• MA: [t u:] → {(to-1,Prep),(two,Num),(to-2,Part/inf),(too,Adv)}

• tagging: ... → (Prep)


n Hajic4

Tagsets

• General definition:– tag ~ (c1,c2,...,cn)

– often thought of as a flat list

T = {ti}i=1..n

with some assumed 1:1 mapping

T ↔ (C1,C2,...,Cn)

• English tagsets (see MS):– Penn treebank (45) (VBZ: Verb,Pres,3,sg, JJR: Adj. Comp.)

– Brown Corpus (87), Claws c5 (62), London-Lund (197)


n Hajic5

Other Language Tagsets

• Differences:– size (10..10k)

– categories covered (POS, Number, Case, Negation,...)

– level of detail

– presentation (short names vs. structured (“positional”))

• Example:

– Czech: AGFS3----1A----

POS

SUBPOS

GENDER

NUMBER

CASE

POSSG

POSSNPERSON

TENSEDCOMP

NEG

VOICE

VAR


n Hajic6

Tagging Inside Morphology

• Do tagging first, then morphology:

• Formally: A+ → T → (L,C1,C2,...,Cn)

• Rationale: – have |T| < |(L,C1,C2,...,Cn)| (thus, less work for the tagger) and k

eep the mapping T → (L,C1,C2,...,Cn) unique.

• Possible for some languages only (“English-like”)• Same effect within “regular” A+ → 2(L,C1,C2,...,Cn) → T:

– mapping R : (C1,C2,...,Cn) → Treduced,

then (new) unique mapping U: A+ ⅹTreduced → (L,T)


n Hajic7

Lemmatization

• Full morphological analysis:MA: A+ → 2(L,C1,C2,...,Cn)

(recall: a lemma l ∈L is a lexical unit (~ dictionary entry ref)

• Lemmatization: reduced MA:– L: A+ → 2L: w → {l; (l,t1,t2,...,tn) ∈MA(w)}

– again, need to disambiguate (want: A+ → L)

(special case of word sense disambiguation, WSD)

– “classic” tagging does not deal with lemmatization

(assumes lemmatization done afterwards somehow)


n Hajic8

Morphological Analysis: Methods

• Word form list• books: book-2/VBZ, book-1/NNS

• Direct coding• endings: verbreg:s/VBZ, nounreg:s/NNS, adje:er/JJR, ...• (main) dictionary: book/verbreg, book/nounreg,nic/adje:nice

• Finite state machinery (FSM)• many “lexicons”, with continuation links: reg-root-lex → reg-end-lex• phonology included but (often) clearly separated

• CFG, DATR, Unification, ...• address linguistic rather than computational phenomena• in fact, better suited for morphological synthesis (generation)


n Hajic9

Word Lists

• Works for English– “input” problem: repetitive hand coding

• Implementation issues:– search trees

– hash tables (Perl!)

– (letter) trie:

• Minimization? t

t

a

n

d

at,Prep

a,Art

a,Artv

ant,NNand,Conj


n Hajic10

Word-internal1 Segmentation (Direct)

• Strip prefixes: (un-, dis-, ...)• Repeat for all plausible endings:

– Split rest: root + ending (for every possible ending)– Find root in a dictionary, keep dictionary information

• in particular, keep inflection class (such as reg, noun-irreg-e, ...)

– Find ending, check inflection+prefix class match– If match found:

• Output root-related info (typically, the lemma(s))• Output ending-related information (typically, the tag(s)).

1Word segmentation is a different problem (Japanese, speech in general)


n Hajic11

Finite State Machinery

• Two-level Morphology– phonology + “morphotactics” (= morphology)

• Both components use finite-state automata:– phonology: “two-level rules”, converted to FSA

• e:0 ⇔_ +:0 e:e r:r (nice nicer)

– morphology: linked lexicons• root-dic: book/”book” end-noun-reg-dic

• end-noun-reg-dic: +s/”NNS”

• Integration of the two possible (and simple)


n Hajic12

Finite State Transducer

• FST is a FSA where– symbols are pairs (r:s) from a finite alphabets R and S.

• “Checking” run:– input data: sequence of pairs, output: Yes/No (accept/do not)

– use as a FSA

• Analysis run:– input data: sequence of only s ∈ S (TLM: surface);

– output: seq. of r ∈ R (TLM: lexical), + lexicon “glosses”

• Synthesis (generation) run:– same as analysis except roles are switched: S ↔R, no gloss


n Hajic13

FST Example• German umlaut (greatly simplified!):

u ↔ü if (but not only if) c h e r follows (Buch → Bücher)

rule: u:ü ⇐c:c h:h e:e r:r

FST:

Buch/Buch:

F1 F3 F4 F5

Buch/Bucher:

F1 F3 F4 F5 F6 N1

Buch/Buck:

F1 F3 F4 F1

u:ü

u:othc:c

h:h e:e

r:r

u:othu:oth

u:ü oth

oth

oth

u:oth

u:othu:oth

oth

oth

any

F1

F2

F3

F4

F5

F6

N1oth


n Hajic14

Parallel Rules, Zero Symbols

• Parallel Rules:– Each rule ~ one FST

– Run in parallel

– Any of them fails path fails

• Zero symbols (one side only, even though 0:0 o.k.)– behave like any other

e:0

+:0

F5

F6


n Hajic15

The Lexicon

• Ordinary FSA (“lexical” alphabet only)• Used for analysis only (NB: disadvantage of TLM):

– additional constraint:• lexical string must pass the linked lexicon list

• Implemented as a FSA; compiled from lists of strings and lexicon links

• Example:

b o o k

ka n+ s

“bank”

“book”

“NNS”


n Hajic16

TLM: Analysis

1. Initialize set of paths to P = {}.

2. Read input symbols, one at a time.

3. At each symbol, generate all lexical symbols possibly corresponding to the 0 symbol (voilà!).

4. Prolong all paths in P by all such possible (x:0) pairs.

5. Check each new path extension against the phonological FST and lexical FSA (lexical symbols only); delete impossible paths prefixes.

6. Repeat 4-5 until max. # of consecutive 0 reached.


n Hajic17

TLM: Analysis (Cont.)

7. Generate all possible lexical symbols (get from all FSTs) for the current input symbol, form pairs.

8. Extend all paths from P using all such pairs.

9. Check all paths from P (next step in FST/FSA). Delete all outright impossible paths.

10. Repeat from 3 until end of input.

11. Collect lexical “glosses” from all surviving paths.


n Hajic18

TLM Analysis Example• Bücher:

• suppose each surface letter corresponds to the same symbol at the lexical level, just ü might be ü as well as u lexically; plus zeroes (+:0), (0:0)

• Use the FST as before.

• Use lexicons:

root: Buch “book” end-reg-uml

Bündni “union” end-reg-s

end-reg-uml: +0 “NNomSg”

+er “NNomPl”B:B Bu:Bü Buc:Büc Buch:Büch Buch+e:Büch0e Buch+er:Büch0er

Bü:Bü Büc:Bücu

ü


n Hajic19

TLM: Generation

• Do not use the lexicon (well you have to put the “right” lexical strings together somehow!)

• Start with a lexical string L.• Generate all possible pairs l:s for every symbol in L.• Find all (hopefully only 1!) traversals through the FST

which end in a final state. • From all such traversals, print out the sequence of surf

ace letters.


n Hajic20

TLM: Some Remarks

• Parallel FST (incl. final lexicon FSA)– can be compiled into a (gigantic) FST

– maybe not so gigantic (XLT - Xerox Language Tools)

• “Double-leveling” the lexicon:– allows for generation from lemma, tag

– needs: rules with strings of unequal length

• Rule Compiler– Karttunen, Kay

• PC-KIMMO: free version from www.sil.org (Unix,too)

10/19/1999 JHU CS 600.465/Jan Hajic 21


Tagging: An Overview

Dr. Jan Hajič


[email protected]



n Hajic22

Rule-based Disambiguation• Example after-morphology data (using Penn tagset):

I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP• Rules using

– word forms, from context & current position

– tags, from context and current position

– tag sets, from context and current position

– combinations thereof


n Hajic23

Example Rules I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP

• If-then style:• DTeq,-1,Tag

(implies NNin,0,Set as a condition) • PRPeq,-1,Tag and DTeq,+1,Tag VBP• {DT,NN}sub,0,Set DT• {VB,VBZ,VBP,VBD,VBG}inc,+1,Tag not DT

• Regular expressions:• not(<*,*,DTnot• not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• not(<*,{DT,NN}sub,notDT• not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)


n Hajic24

Implementation

• Finite State Automata– parallel (each rule ~ automaton);

• algorithm: keep all paths which cause all automata say yes

– compile into single FSA (intersection)

• Algorithm:– a version of Viterbi search, but:

• no probabilities (“categorical” rules)

• multiple input:– keep track of all possible paths


n Hajic25

Example: the FSA

• R1: not(<*,*,DTnot• R2: not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• R3: not(<*,{DT,NN}sub,DT• R4: not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)

• R1:

• R3:

<*,*,DT notF1 F2 N3

anything else

anything else

anything

<*,{DT,NN}sub,notDTF1 N2

anything else

anything


n Hajic26

Applying the FSA• R1: not(<*,*,DTnot• R2: not(<*,*,PRP>,<*,*,notVBP>,<*,*,DT>)• R3: not(<*,{DT,NN}sub,DT• R4: not(<*,*,DT>,<*,*,{VB,VBZ,VBP,VBD,VBG}>)

• R1 blocks: remains: or

• R2 blocks: remains e.g.: and more

• R3 blocks: remains only:• R4 R1!

I watch a

NN DT

PRP VB

a fly

DT

VB

VBP

a fly

DT NN

a fly

NN

NN VB

VBP

I watch a

DT

PRP

VBP

I watch a fly . NN NN DT NN . PRP VB NN VB VBP VBP

a

NN

a

DT


n Hajic27

Applying the FSA (Cont.)

• Combine:

• Result:


a fly

DT NN

a fly

NN

NN VB

VBP

I watch a

DT

PRP

VBP

a

DT

I watch a fly .

PRP VBP DT NN .


n Hajic28

Tagging by Parsing

• Build a parse tree from the multiple input:

• Track down rules: e.g., NP → DT NN: extract (a/DT fly/NN)

• More difficult than tagging itself; results mixed


NP

VP

S


n Hajic29

Statistical Methods (Overview)

• “Probabilistic”:• HMM

– Merialdo and many more (XLT)

• Maximum Entropy– DellaPietra et al., Ratnaparkhi, and others

• Rule-based:• TBEDL (Transformation Based, Error Driven Learning)

– Brill’s tagger

• Example-based– Daelemans, Zavrel, others

• Feature-based (inflective languages)• Classifier Combination (Brill’s ideas)

10/19/1999 JHU CS 600.465/Jan Hajic 30


HMM Tagging

Dr. Jan Hajič


[email protected]



n Hajic31

Review

• Recall:– tagging ~ morphological disambiguation

– tagset VT (C1,C2,...Cn)

• Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ...

– mapping w → {t ∈VT} exists• restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn)

where A is the language alphabet, L is the set of lemmas

– extension to punctuation, sentence boundaries (treated as words)


n Hajic32

The Setting

• Noisy Channel setting: Input (tags) Output (words)

The channel

NNP VBZ DT... (adds “noise”) John drinks the ...

• Goal (as usual): discover “input” to the channel (T, the tag seq.) given the “output” (W, the word sequence)

– p(T|W) = p(W|T) p(T) / p(W)

– p(W) fixed (W given)...

argmaxT p(T|W) = argmaxT p(W|T) p(T)


n Hajic34

The HMM Model Definition

• (Almost) the general HMM:– output (words) emitted by states (not arcs)

– states: (n-1)-tuples of tags if n-gram tag model used

– five-tuple (S, s0, Y, PS, PY), where:

• S = {s0,s1,s2,...,sT} is the set of states, s0 is the initial state,

• Y = {y1,y2,...,yV} is the output alphabet (the words),

• PS(sj|si) is the set of prob. distributions of transitions

– PS(sj|si) = p(ti|ti-n+1,...,ti-1); sj = (ti-n+2,...,ti), si = (ti-n+1,...,ti-1)

• PY(yk|si) is the set of output (emission) probability distributions

– another simplification: PY(yk|si) = PY(yk|sj) if si and sj contain the same tag as the rightmost element: PY(yk|si) = p(wi|ti)


n Hajic35

Supervised Learning (Manually Annotated Data Available)

• Use MLE– p(wi|ti) = cwt(ti,wi) / ct(ti)

– p(ti|ti-n+1,...,ti-1) = ctn(ti-n+1,...,ti-1,ti) / ct(n-1)(ti-n+1,...,ti-1)

• Smooth (both!)– p(wi|ti): “Add 1” for all possible tag,word pairs using a

predefined dictionary (thus some 0 kept!)

– p(ti|ti-n+1,...,ti-1): linear interpolation:• e.g. for trigram model:

p’(ti|ti-2,ti-1) = 3 p(ti|ti-2,ti-1) + 2 p(ti|ti-1) + 1 p(ti) + 0 / |VT|


n Hajic36

Unsupervised Learning

• Completely unsupervised learning impossible– at least if we have the tagset given- how would we associate

words with tags?

• Assumed (minimal) setting:– tagset known

– dictionary/morph. analysis available (providing possible tags for any word)

• Use: Baum-Welch algorithm (see class 15, 10/13)– “tying”: output (state-emitting only, same dist. from two state

s with same “final” tag)


n Hajic37

Comments on Unsupervised Learning

• Initialization of Baum-Welch– is some annotated data available, use them

– keep 0 for impossible output probabilities

• Beware of:– degradation of accuracy (Baum-Welch criterion: entrop

y, not accuracy!)

– use heldout data for cross-checking

• Supervised almost always better


n Hajic38

Unknown Words

• “OOV” words (out-of-vocabulary)– we do not have list of possible tags for them

– and we certainly have no output probabilities

• Solutions:– try all tags (uniform distribution)

– try open-class tags (uniform, unigram distribution)

– try to “guess” possible tags (based on suffix/ending) - use different output distribution based on the ending (and/or other factors, such as capitalization)


n Hajic39

Running the Tagger

• Use Viterbi– remember to handle unknown words

– single-best, n-best possible

• Another option:– assign always the best tag at each word, but consider all

possibilities for previous tags (no back pointers nor a path-backpass)

– introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors)


n Hajic40

(Tagger) Evaluation

• A must: Test data (S), previously unseen (in training)– change test data often if at all possible! (“feedback cheating”)

– Error-rate based

• Formally: – Out(w) = set of output “items” for an input “item” w

– True(w) = single correct output (annotation) for w

– Errors(S) = i=1..|S|(Out(wi) ≠ True(wi))

– Correct(S) = i=1..|S|(True(wi) ∈ Out(wi))

– Generated(S) = i=1..|S||Out(wi)|


n Hajic41

Evaluation Metrics

• Accuracy: Single output (tagging: each word gets a single tag)– Error rate: Err(S) = Errors(S) / |S|

– Accuracy: Acc(S) = 1 - (Errors(S) / |S|) = 1 - Err(S)

• What if multiple (or no) output?– Recall: R(S) = Correct(S) / |S|

– Precision: P(S) = Correct(S) / Generated(S)

– Combination: F measure: F = 1 / (/P + (1-)/R)• is a weight given to precision vs. recall; for =.5, F = 2PR/(R+P)

10/19/1999 JHU CS 600.465/Jan Hajic 42


Transformation-Based Tagging

Dr. Jan Haji?


[email protected]



n Hajic43

The Task, Again

• Recall:– tagging ~ morphological disambiguation

– tagset VT (C1,C2,...Cn)

• Ci - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER, ...

– mapping w → {t ∈VT} exists• restriction of Morphological Analysis: A+ → 2(L,C1,C2,...,Cn)

where A is the language alphabet, L is the set of lemmas

– extension to punctuation, sentence boundaries (treated as words)


n Hajic44

Setting

• Not a source channel view• Not even a probabilistic model (no “numbers” used

when tagging a text after a model is developed)• Statistical, yes:

• uses training data (combination of supervised [manually annotated data available] and unsupervised [plain text, large volume] training)

• learning [rules]

• criterion: accuracy (that’s what we are interested in in the end, after all!)


n Hajic45

The General SchemeTraining Tagging

Annotatedtraining data

Plain text training data

LEARNER

Rules learned

TAGGER

Datato annotate

Automatically tagged data

training iterations

Partially an-notated data


n Hajic46

The Learner

Annotatedtraining data

Remove tags

Assign initial tags

ATD without annotation

Interimannotation

Interimannotation

Interimannotation

(THE “TRUTH”)

Iteration 1 Iteration 2

Iteration n

Interimannotation

RULES

Add ru

le 1

Add

rule

2

Add

rule

n


n Hajic47

The I/O of an Iteration

• In (iteration i):– Intermediate data (initial or the result of previous iteration)

– The TRUTH (the annotated training data)

– [pool of possible rules]

• Out:– One rule rselected(i) to enhance the set of rules learned so far

– Intermediate data (input data transformed by the rule learned in this iteration, rselected(i))


n Hajic48

The Initial Assignment of Tags

• One possibility:– NN

• Another:– the most frequent tag for a given word form

• Even:– use an HMM tagger for the initial assignment

• Not particularly sensitive


n Hajic49

The Criterion• Error rate (or Accuracy):

– beginning of an iteration: some error rate Ein

– each possible rule r, when applied at every data position:• makes an improvement somewhere in the data (cimproved(r))

• makes it worse at some places (cworsened(r))

• and, of course, does not touch the remaining data

• Rule contribution to the improvement of the error rate:• contrib(r) = cimproved(r) - cworsened(r)

• Rule selection at iteration i:• rselected(i) = argmaxr contrib(r)

• New error rate: Eout = Ein - contrib(rselected(i))


n Hajic50

The Stopping Criterion

• Obvious:– no improvement can be made

• contrib(r) ≤ 0– or improvement too small

• contrib(r) ≤ Threshold

• NB: prone to overtraining!– therefore, setting a reasonable threshold advisable

• Heldout?– maybe: remove rules which degrade performance on H


n Hajic51

The Pool of Rules (Templates)• Format: change tag at position i from a to b / condition• Context rules (condition definition - “template”):

wi-3 wi-2 wi-1 wi wi+1 wi+2 wi+3

ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3

Instantiation: w, t permitted


n Hajic52

Lexical Rules

• Other type: lexical rules

• Example: – wi has suffix -ied

– wi has prefix ge-

wi-3 wi-2 wi-1 wi wi+1 wi+2 wi+3

ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3

“look inside the word”


n Hajic53

Rule Application

• Two possibilities:– immediate consequences (left-to-right):

• data: DT NN VBP NN VBP NN...

• rule: NN → NNS / preceded by NN VBP

• apply rule at position 4:

DT NN VBP NN VBP NN...

DT NN VBP NNS VBP NN...

• ...then rule cannot apply at position 6 (context not NN VBP).

– delayed (“fixed input”):• use original input for context

• the above rule then applies twice.


n Hajic54

In Other Words...

• 1. Strip the tags off the truth, keep the original truth• 2. Initialize the stripped data by some simple method• 3. Start with an empty set of selected rules S.• 4. Repeat until the stopping criterion applies:

– compute the contribution of the rule r, for each r: contrib(r) = cimproved(r) - cworsened(r)

– select r which has the biggest contribution contrib(r), add it to the final set of selected rules S.

• 5. Output the set S.


n Hajic55

The Tagger

• Input:– untagged data

– rules (S) learned by the learner

• Tagging:– use the same initialization as the learner did

– for i = 1..n (n - the number of rules learnt)• apply the rule i to the whole intermediate data, changing

(some) tags

– the last intermediate data is the output.


n Hajic56

N-best & Unsupervised Modifications

• N-best modification– allow adding tags by rules

– criterion: optimal combination of accuracy and the number of tags per word (we want: close to ↓1)

• Unsupervised modification– use only unambiguous words for evaluation criterion

– work extremely well for English

– does not work for languages with few unambiguous words

*Introduction to Natural Language Processing (600.465) Tagging, Tagsets, and Morphology

Documents

Transcript of *Introduction to Natural Language Processing (600.465) Tagging, Tagsets, and Morphology