Chapter 16: Features and Unification Heshaam Faili [email protected] University of Tehran.
Chapter 5: POS Tagging Heshaam Faili [email protected] University of Tehran.
-
Upload
lee-blankenship -
Category
Documents
-
view
228 -
download
2
Transcript of Chapter 5: POS Tagging Heshaam Faili [email protected] University of Tehran.
![Page 2: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/2.jpg)
2
Part of Speech tag POS, word classes, morphological classes,
lexical tags Verb Vs Noun noun, verb, pronoun, preposition, adverb,
Conjunction, participle, and article possessive pronouns Vs personal pronouns Useful for pronunciation: conTENT, CONtent Useful for stemming for information retrieval Useful for Word-sense-disambiguation Useful for shallow parsing to detect name, date,
place, …
![Page 3: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/3.jpg)
3
POS tags Based on syntactic and morphological function Divided into
closed class types: pronoun, preposition, determiner(articles), conjunction, auxiliary verbs, particles, numerical
open class types: nouns, verbs, adjectives, adverb Proper noun, common noun Count noun, mass noun Main verb, auxiliary verb, Directional(locative) adverbs, degree adverb, manner
adverb, temporal adverbs particle, phrasal verb
![Page 4: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/4.jpg)
4
Statistics the: 1,071,676 a: 413,887 an: 59,359
![Page 5: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/5.jpg)
5
Statistics
![Page 6: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/6.jpg)
6
English language Pronoun: personal, possessive, WH-pronoun Auxiliary verb: modal verb (copula)
![Page 7: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/7.jpg)
7
![Page 8: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/8.jpg)
8
Example English Part-of-Speech Tag sets
Brown corpus - 87 tags Allows compound tags
“I'm” tagged as PPSS+BEM PPSS for "non-3rd person nominative personal pronoun" and
BEM for "am, 'm“ Others have derived their work from Brown
Corpus LOB Corpus: 135 tags Lancaster UCREL Group: 165 tags London-Lund Corpus: 197 tags. BNC – 61 tags (C5) PTB – 45 tags
Other languages have developed other tagsets
![Page 9: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/9.jpg)
9
PTB Tagset (36 main tags + punctuation tags)
![Page 10: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/10.jpg)
10
Some examples from Brown
The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
There/EX are/VBP 70/CD children/NNS there/RB Although/IN preliminary/JJ findings/NNS were/VBD
reported/VBN more/RBR than/IN a/DT year/NN ago/IN ,/, the/DT latest/JJS results/NNS appear/VBP in/IN today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN Medicine/NNP ,/,
Difficulty in tagging ‘around’’: particle, preposition, adverb
Mrs./NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG
All/DT we/PRP gotta/VBN do/VB is/VBZ go/VB around/IN the/DT corner/NN
Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD
![Page 11: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/11.jpg)
11
POS Tagging Problem Given a sentence W1…Wn and a tag set of lexical categories,
find the most likely tag C1…Cn for each word in the sentence Note that many of the words may have unambiguous tags Example
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NNPeople/NNS continue/VBP to/TO inquire/VB the/DT reason/NN
for/IN the/DT race/NN for/IN outer/JJ space/NN But enough words are either ambiguous or unknown that it’s
a nontrivial task Tokenization: punctuation marks (period, comma, etc) be
separated off of the words.
![Page 12: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/12.jpg)
12
More details of the problem
How ambiguous? Most words in English have only one Brown Corpus tag
Unambiguous (1 tag) 35,340 word types Ambiguous (2- 7 tags) 4,100 word types = 11.5%
7 tags: 1 word type “still” But many of the most common words are ambiguous
Over 40% of Brown corpus tokens are ambiguous Obvious strategies may be suggested based on intuition
to/TO race/VB the/DT race/NN will/MD race/VB
This leads to hand-crafted rule-based POS tagging (J&M, 8.4) Sentences can also contain unknown words for which tags
have to be guessed: Secretariat/NNP is/VBZ
![Page 13: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/13.jpg)
13
POS tagging
![Page 14: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/14.jpg)
14
POS tagging methods First, we’ll look at how POS-annotated corpora
are constructed Then, we’ll narrow in on various POS tagging
methods, which rely on POS-annotated corpora Supervised learning: use POS-annotated corpora as
training material HMMs, TBL, Maximum Entropy, MBL
Unsupervised learning: use training material which is unannotated
HMMs, TBL
Two main categories for tagging: rule-based, stochastic
![Page 15: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/15.jpg)
15
Constructing POS-annotated corpora
By examining how POS-annotated corpora are created, we will understand the task even better
Basic steps 1. Tokenize the corpus 2. Design a tagset, along with guidelines 3. Apply the tagset to the text
Knowing what issues the corpus annotators had to deal with in hand-tagging tells us the kind of issues we’ll have to deal with in automatic tagging
![Page 16: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/16.jpg)
16
1. Tokenize the corpus The corpus is first segmented into sentences
and then each sentence is tokenized into unique tokens (words)
Punctuation usually split off Figuring out sentence-ending periods vs. abbreviators
periods is not always easy, etc. Can become tricky:
Proper nouns and other multi-word units: one token or separate tokens? e.g., Jimmy James, in spite of
Contractions often split into separate words: can’t can n’t (because these are two different POSs)
Hyphenations sometimes split, sometimes kept together
![Page 17: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/17.jpg)
17
Some other preprocessing issues
Should other word form information be included? Lemma: the base, or root, form of a word. So, cat and
cats have the same lemma, cat. Sometimes, words are lemmatized by stemming, other
times by morphological analysis, using a dictionary & morphological rules
Sometimes it makes a big difference – roughly vs. rough Fold case or not (usually folded)?
The the THE Mark versus mark One may need, however, to regenerate the original case
when presenting it to the user
![Page 18: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/18.jpg)
18
2. Design the Tagset It is not obvious what the parts of speech are in any
language, so a tagset has to be designed. Criteria:
External: what do we need to capture the linguistic properties of the text?
Internal: what distinctions can be made reliably in a text? Some tagsets are compositional: each character in
the tag has a particular meaning: e.g., Vr3s-f = verb-present-3rd-singular-<undefined_for_case>-female
What do you do with multi-token words, e.g. in terms of?
![Page 19: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/19.jpg)
19
Tagging Guidelines The tagset should be clearly documented and
explained, with a “case law” of what to do in problematic cases
If you don’t do this, you will have problems in applying the tagset to the text
If you have good documentation, you can have high interannotator agreement and a corpus of high quality.
![Page 20: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/20.jpg)
20
3. Apply Tagset to Text: PTB Tagging Process
1. Tagset developed 2. Automatic tagging by rule-based and statistical POS
taggers, error rates of 2-6%. 3. Human correction using a text editor
Takes under a month for humans to learn this (at 15 hours a week), and annotation speeds after a month exceed 3,000 words/hour
Inter-annotator disagreement (4 annotators, eight 2000-word docs) was 7.2% for the tagging task and 4.1% for the correcting task
![Page 21: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/21.jpg)
21
POS Tagging Methods Two basic ideas to build from:
Establishing a simple baseline with unigrams Hand-coded rules
Machine learning techniques: Supervised learning techniques Unsupervised learning techniques
![Page 22: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/22.jpg)
22
A Simple Strategy for POS Tagging
Choose the most likely tag for each ambiguous word, independent of previous words
i.e., assign each token the POS category it occurred as most often in the training set
E.g., race – which POS is more likely in a corpus?
This strategy gives you 90% accuracy in controlled tests
So, this “unigram baseline” must always be compared against
![Page 23: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/23.jpg)
23
Example of the Simple Strategy
Which POS is more likely in a corpus (1,273,000 tokens)? NN VB Total
race 400 600 1000
P(NN|race) = P(race&NN) / P(race) by the definition of conditional probability
P(race) 1000/1,273,000 = .0008 P(race&NN) 400/1,273,000 =.0003 P(race&VB) 600/1,273,000 = .0005
And so we obtain: P(NN|race) = P(race&NN)/P(race) = .0003/.0008 =.4 P(VB|race) = P(race&VB)/P(race) = .0004/.0008 = .6
![Page 24: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/24.jpg)
24
Hand-coded rules Two-stage system:
Dictionary assigns all possible tags to a word Rules winnow down the list to a single tag (or
sometimes, multiple tags are left, if it cannot be determined)
Can also use some probabilistic information These systems can be highly effective, but
they of course take time to write the rules. We’ll see an example later of trying to automatically
learn the rules (transformation-based learning)
![Page 25: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/25.jpg)
25
Hand-coded Rules: ENGTWOL System
Based on two-level morphology Uses 56,000-word lexicon which lists parts-of-
speech for each word Each entry is annotated with a set of
morphological and syntactic features
![Page 26: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/26.jpg)
26
ENGTWOL Lexicon
![Page 27: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/27.jpg)
27
ENGTWOL tagger First Stage return all possible POS
Input: “Pavlov had shown that salivation” Output:
Pavlov PAVLOV N NOM SG PROPER had HAVE V PAST VFIN SVO
HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV
PRON DEM SG DET CENTRAL DEM SG CS
salivation N NOM SG
![Page 28: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/28.jpg)
28
ENGTWOL tagger Large set of constraints: 3744 constraints
![Page 29: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/29.jpg)
29
Machine Learning Machines can learn from examples Learning can be supervised or unsupervised Given training data, machines analyze the data,
and learn rules which generalize to new examples Can be sub-symbolic (rule may be a mathematical
function) e.g., neural nets Or it can be symbolic (rules are in a representation that
is similar to representation used for hand-coded rules) In general, machine learning approaches allow for
more tuning to the needs of a corpus, and can be reused across corpora
![Page 30: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/30.jpg)
30
Ways of learning Supervised learning:
A machine learner learns the patterns found in an annotated corpus
Unsupervised learning: A machine learner learns the patterns found in an
unannotated corpus Often uses another database of knowledge, e.g., a
dictionary of possible tags Techniques used in supervised learning can be
adapted to unsupervised learning, as we will see.
![Page 31: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/31.jpg)
31
HMMs Using probabilities in tagging first used on
1965 but completed o 1980s (Marshall, 1983; Garside, 1987; Church, 1988; DeRose, 1988(
HMM is a special case of Bayesian inference
![Page 32: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/32.jpg)
32
Transition/word likelihood probabilities
P(NN|DT) and P(JJ|DT) is high while P(DT|JJ) is low
![Page 33: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/33.jpg)
33
HMMs: A Probabilistic Approach
What you want to do is find the “best sequence” of POS tags T=T1..Tn for a sentence W=W1..Wn.
(Here T1 is pos_tag(W1)).
find a sequence of POS tags T that maximizes P(T|W)
Using Bayes’ Rule, we can sayP(T|W) = P(W|T)*P(T)/P(W)
We want to find the value of T which maximizes the RHS denominator can be discarded (it’s the same for every T)
Find T which maximizes P(W|T) * P(T)
Example: He will race Possible sequences:
He/PP will/MD race/NN He/PP will/NN race/NN He/PP will/MD race/VB He/PP will/NN race/VB
W = W1 W2 W3
= He will race T = T1 T2 T3
Choices: T= PP MD NN T= PP NN NN T = PP MD VB T = PP NN VB
Word likelihood
transition
![Page 34: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/34.jpg)
34
N-gram Models POS problem formulation
Given a sequence of words, find a sequence of categories that maximizes P(T1…Tn| W1…Wn)
i.e., that maximizes P(W1…Wn | T1…Tn) * P(T1…Tn) (by Bayes’ Rule)
Chain Rule of probability: P(W|T) = i=1, n P(Wi|W1…Wi-1T1…Ti)
prob. of this word based on previous words & tags
P(T) = i=1, n P(Ti|W1…WiT1…Ti-1)
prob. of this tag based on previous words & tags But we don’t have sufficient data for this, and we would
likely overfit the data, so we make some assumptions to simplify the problem …
![Page 35: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/35.jpg)
35
Independence Assumptions
Assume that current event is based only on previous n-1 events (i.e., for bigram model, based on previous event)
P(T1…Tn) i=1, n P(Ti| Ti-1) assumes that the event of a POS tag occurring is independent
of the event of any other POS tag occurring, except for the immediately previous POS tag
From a linguistic standpoint, this seems an unreasonable assumption, due to long-distance dependencies
P(W1…Wn | T1…Tn) i=1, n P(Wi| Ti) assumes that the event of a word appearing in a category is
independent of the event of any surrounding word or tag, except for the tag at this position.
![Page 36: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/36.jpg)
36
Independence Assumptions
P(Ti| Ti-1) represent the probability of a tag given the previous tag.
determiners are very likely to precede adjectives and nouns, as in sequences like that/DT flight/NN and the/DT yellow/JJ hat/NN.
P(NN|DT) and P(JJ|DT) are high but P(DT|JJ) is low
![Page 37: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/37.jpg)
37
Hidden Markov Models Linguists know both these assumptions are
incorrect! But, nevertheless, statistical approaches based
on these assumptions work pretty well for part-of-speech tagging
In particular, one called a Hidden Markov Model (HMM)
Very widely used in both POS-tagging and speech recognition, among other problems
A Markov model, or Markov chain, is just a weighted Finite State Automaton
![Page 38: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/38.jpg)
38
HMMs = Weighted FSAs Nodes (or states): tag options Edges: Each edge from a node has a (transition) probability to
another state The sum of transition probabilities of all edges going out from a
node is 1 The probability of a path is the product of the probabilities of
the edges along the path. Why multiply? Because of the assumption of independence
(Markov assumption). The probability of a string is the sum of the probabilities of all
the paths for the string. Why add? Because we are considering the union of all
interpretations. A string that isn’t accepted has probability zero.
![Page 39: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/39.jpg)
39
example
![Page 40: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/40.jpg)
40
HMM Weighted finite-state automaton Markov Chain: a special case of a weighted
automaton in which the input sequence uniquely determines which states the automaton will go through
Use only observed events : in POS tagging problem: only words
It’s not useful for POS-tagging: because of ambiguity Hidden Markov Model: use both
observed events ( words that we see in sentence) Hidden events (part-of-speech tags)
![Page 41: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/41.jpg)
41
Hidden Markov Models Why the word “hidden”? Called “hidden” because one cannot tell what
state the model is in for a given sequence of words.
e.g., will could be generated from NN state, or from MD state, with different probabilities.
That is, looking at the words will not tell you directly what tag state you’re in.
![Page 42: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/42.jpg)
42
POS Tagging Based on Bigrams
Problem: Find T which maximizes P(W | T) * P(T) Here W=W1…Wn and T=T1…Tn
Using the bigram model, we get: Transition probabilities (prob. of transitioning from one
state/tag to another): P(T1…Tn) i=1, n P(Ti|Ti-1)
Emission probabilities (prob. of emitting a word at a given state):
P(W1…Wn | T1…Tn) i=1, n P(Wi| Ti)
So, we want to find the value of T1…Tn which maximizes:i=1, n P(Wi| Ti) * P(Ti| Ti-1)
![Page 43: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/43.jpg)
43
Using POS bigram probabilities: transitions
P(T1….Tn) i=1, n P(Ti|Ti-1)
Example: He will race Choices for T=T1..T3
T= PRP MD NN T= PRP NN VB T = PRP MD VB T = PRP NN NN
POS bigram probs from training corpus can be used for P(T)
P(PRP-MD-NN)=1*.8*.4 =.32
C|R MD NN VB PRP
MD .4 .6
NN .3 .7
PRP .8 .2 1
POS bigram probs
PRP
MDNN
VBNN
.4
.6.3
.7
.8
.2
1
![Page 44: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/44.jpg)
44
Factoring in lexical generation probabilities
From the training corpus, we need to find the Ti which maximizesi=1, n P(Wi| Ti) * P(Ti| Ti-1)
So, we’ll need to factor the lexical generation (emission) probabilities, somehow:
MD NN VB PRP
he 0 0 0 1
will .8 .2 0 0
race 0 .4 .6 0
lexical generation probs
PRP
MDNN
VBNN
.4
.6.3
.7
.8
.2
1
+A B
D
F
C E
![Page 45: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/45.jpg)
45
Adding emission probabilities
he|PRP .3
will|MD.8
race|NN.4
race|VB.6
will|NN .2
.4
.6.3
.7
.8
.2
<s>| 1 lexical generation probs
pos bigram probs
Like a weighted FSA, except that there are output
probabilities associated with each node.
MD NN VB PRP
he 0 0 0 1
will .8 .2 0 0
race 0 .4 .6 0
C|R MD NN VB PRP
MD .4 .6
NN .3 .7
PRP .8 .2 1
![Page 46: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/46.jpg)
46
HMM formulation
![Page 47: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/47.jpg)
47
HMM formulation
![Page 48: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/48.jpg)
48
Emission &Transition probability
![Page 49: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/49.jpg)
49
Dynamic Programming In order to find the most likely sequence of
categories for a sequence of words, we don’t need to enumerate all possible sequences of categories.
Because of the Markov assumption, if you keep track of the most likely sequence found so far for each possible ending category, you can ignore all the other less likely sequences.
i.e., multiple edges coming into a state, but only keep the value of the most likely path
This is another use of dynamic programming The algorithm to do this is called the Viterbi
algorithm.
![Page 50: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/50.jpg)
50
The Viterbi algorithm Assume we’re at state I in the HMM Obtain
the probability of each previous state H1…Hm the transition probabilities of H1-I, …, Hm-I the emission probability for word w at I
Multiple the probabilities for each new path: P(H1,I) = Score(H1)*P(I|H1)*P(w|I) …
One of these states (H1…Hm) will give the highest probability
Only keep the highest probability when using I for the next state
![Page 51: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/51.jpg)
51
![Page 52: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/52.jpg)
52
Finding the best path through an HMM
Score(I) = Max H < I [Score(H)* transition(I|H)]* lex(I)
Score(B) = P(PP|)* P(he|PP) =1*.3=.3 Score(C)=Score(B) *P(MD|PP) * P(will|MD) = .3*.8*.8= .19 Score(D)=Score(B) *P(NN|PP) * P(will|NN) = .3*.2*.2= .012 Score(E) = Max [Score(C)*P(NN|MD), Score(D)*P(NN|NN)] *P(race|NN) = Score(F) = Max [Score(C)*P(VB|MD), Score(D)*P(VB|NN)]*P(race|VB)=
he|PP.3
will|MD.8
race|NN.4
race|VB.6
will|NN.2
.4
.6.3
.7
.8
.2
<s>| 1
MD NN VB PP
he 0 0 0 .3
will .8 .2 0 0
race 0 .4 .6 0
lexical generation probs
A
C
B
D
E
Flex(B)
Viterbialgorithm
![Page 53: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/53.jpg)
53
![Page 54: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/54.jpg)
54
Smoothing Lexical generation probabilities will lack
observations for low-frequency and unknown words
Most systems do one of the following Smooth the counts
E.g., add a small number to unseen data (to zero counts). For example, assume a bigram not seen in the data has a very small probability, e.g., .0001.
Use lots more data (but you’ll still need to smooth) Group items into classes, thus increasing class frequency
e.g., group words into ambiguity classes, based on their set of tags. For counting, all words in an ambiguity class are treated as variants of the same ‘word’
![Page 55: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/55.jpg)
55
Extending the HMM algorithm to trigrams
Sparnness
![Page 56: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/56.jpg)
56
2. TBL: A Symbolic Learning Method
HMMs are subsymbolic – they don’t give you rules that you can inspect
A method called error-driven Transformation-Based Learning (TBL) (Brill algorithm) can be used for symbolic learning
The rules (actually, a sequence of rules) are learned from an annotated corpus
Performs at least as accurately as other statistical approaches
Has better treatment of context compared to HMMs rules which use the next (or previous) POS
HMMs just use P(Ci| Ci-1) or P(Ci| Ci-2,Ci-1) rules which use the previous (next) word
HMMs just use P(Wi|Ci)
![Page 57: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/57.jpg)
57
Brill Algorithm (Overview) Assume you are given a
training corpus G (for gold standard)
First, create a tag-free version V of it
Notes: As the algorithm proceeds,
each successive rule becomes narrower (covering fewer examples, i.e., changing fewer tags), but also potentially more accurate
Some later rules may change tags changed by earlier rules
1. First label every word token in V with most likely tag for that word type from G. If this ‘initial state annotator’ is perfect, you’re done!
2. Then consider every possible transformational rule, selecting the one that leads to the most improvement in V using G to measure the error
3. Retag V based on this rule4. Go back to 2, until there is
no significant improvement in accuracy over previous iteration
![Page 58: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/58.jpg)
58
Rule Templates Brill’s method learns transformations which fit
different templates Change tag X to tag Y when previous word is W
NN VB when previous word = to Change tag X to tag Y when next tag is Z
NN NNP when next tag = NNP Change tag X to tag Y when previous 1st, 2nd, or 3rd word is
W VBP VB when one of previous 3 words = has
The learning process is guided by a small number of templates (e.g., 26) to learn specific rules from the corpus
Note how these rules sort of match linguistic intuition
![Page 59: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/59.jpg)
59
Error-driven method How does one learn the rules? The TBL method is error-driven
The rule which is learned on a given iteration is the one which reduces the error rate of the corpus the most, e.g.:
Rule 1 fixes 50 errors but introduces 25 more net decrease is 25
Rule 2 fixes 45 errors but introduces 15 more net decrease is 30
Choose rule 2 in this case We set a stopping criterion, or threshold once
we stop reducing the error rate by a big enough margin, learning is stopped
![Page 60: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/60.jpg)
60
Brill Algorithm (More Detailed)
1. Label every word token with its most likely tag (based on lexical generation probabilities).
2. List the positions of tagging errors and their counts, by comparing with “truth” (T)
3. For each error position, consider each instantiation I of X, Y, and Z in Rule template. If Y=T, increment improvements[I], else increment errors[I].
4. Pick the I which results in the greatest error reduction, and add to output
e.g., VB NN PREV1OR2TAG DT improves 98 errors, but produces 18 new errors, so net decrease of 80 errors
5. Apply that I to corpus 6. Go to 2, unless stopping
criterion is reached
Most likely tag:
P(NN|race) = .98
P(VB|race) = .02
Is/VBZ expected/VBN to/TO race/NN tomorrow/NN
Rule template: Change a word from tag X to tag Y when previous tag is Z
Rule Instantiation for above example: NN VB
PREV1OR2TAG TO
Applying this rule yields:
Is/VBZ expected/VBN to/TO race/VB tomorrow/NN
![Page 61: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/61.jpg)
61
Example of Error Reduction
From Eric Brill (1995):Computational Linguistics, 21, 4, p. 7
![Page 62: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/62.jpg)
62
Rule ordering One rule is learned with every pass through the
corpus The set of final rules is what the final output is Unlike HMMs, such a representation allows a linguist
to look through and make more sense of the rules Thus, the rules are learned iteratively and must
be applied in an iterative fashion. At one stage, it may make sense to change NN to VB
after to But at a later stage, it may make sense to change VB
back to NN in the same context, e.g., if the current word is school
![Page 63: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/63.jpg)
63
Example of Learned Rule Sequence
1. NN VB PREVTAG TO to/TO race/NN->VB
2. VBP VB PREV1OR20R3TAG MD might/MD vanish/VBP-> VB
3. NN VB PREV1OR2TAG MD might/MD not/MD reply/NN -> VB
4. VB NN PREV1OR2TAG DT the/DT great/JJ feast/VB->NN
5. VBD VBN PREV1OR20R3TAG VBZ He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP
![Page 64: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/64.jpg)
64
Handling Unknown Words Can also use the Brill
method to learn how to tag unknown words
Instead of using surrounding words and tags, use affix info, capitalization, etc.
Guess NNP if capitalized, NN otherwise.
Or use the tag most common for words ending in the last 3 letters.
etc. TBL has also been applied
to some parsing tasks
Example Learned Rule Sequence for Unknown Words
![Page 65: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/65.jpg)
65
Insights on TBL TBL takes a long time to train, but is relatively
fast at tagging once the rules are learned The rules in the sequence may be decomposed
into non-interacting subsets, i.e., only focus on VB tagging (need to only look at rules which affect it)
In cases where the data is sparse, the initial guess needs to be weak enough to allow for learning
Rules become increasingly specific as you go down the sequence.
However, the more specific rules don’t overfit because they cover just a few cases
![Page 66: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/66.jpg)
66
Tag Indeterminacy 1. Somehow replace the indeterminate tags
with only one tag. 2. In testing, count a tagger as having
correctly tagged an indeterminate token if it gives either of the correct tags. In training, somehow choose only one of the tags for the word.
3. Treat the indeterminate tag as a single complex tag
![Page 67: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/67.jpg)
67
Tokenization Word tokens Period detection Word splitting
would/MD n’t/RB multi-part words
New York
![Page 68: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/68.jpg)
68
Unknown words pretend that each unknown word is ambiguous
among all possible tags, with equal probability contextual POS-trigrams to suggest the proper tag
the probability distribution of tags over unknown words is very similar to the distribution of tags over words that occurred only once in a training set
hapax legomena: words occur once Good-Turing smoothing algorithm Unknown words and hapax legomena are similar in
that they are both most likely to be nouns followed by verbs, but are very unlikely to be determiners or interjections
![Page 69: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/69.jpg)
69
Unknown words Use more information: morphology
Ended by –s plural noun Ended by –ed past or past participle verb Ended by –able adjective
Use Orthographic information: word start with Capital letter are proper nouns
3 inflectional endings (-ed, -s, -ing), 32 derivational endings (such as -ion, -al, -ive, and -ly), 4 values of capitalization depending on whether a word is sentence-initial (+/- capitalization, +/- initial) and whether the word was hyphenated
P(wi|ti) = P(unknown-word|ti) ∗ P(capital|ti) (5.50) ∗ P(endings/hyph|ti)
![Page 70: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/70.jpg)
70
Unknown words HMM-based approach: P(ti|ln−i+1 . . . ln) Train on open classes Only to word less occuring < 10 Use Bayesian to calculate P(wi|ti )
![Page 71: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/71.jpg)
71
3. Maximum Entropy Maximum entropy: model what you know as best you
can, remain uncertain about the rest Set up some features, similar to Brill’s templates When tagging, the number of features which are true
determines the probability of a tag in a particular context
Make your probability model match the training data as well as it can, i.e., if an example matches features we’ve seen, we know exactly what to do
But the probability model also maximizes the entropy = uncertainty … so, we are noncommittal to anything not seen in the training data
E.g. MXPOST
![Page 72: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/72.jpg)
72
4. Decision Tree Tagging Each tagging decision can be viewed as a
series of questions Each question should make a maximal division in the
data, i.e., ask a question which gives you the most information at that point
Different metrics can be used here Each question becomes more specific as you go down
the tree Training consists of constructing a decision tree Tagging consists of using the decision tree to
give us transition probabilities … and the rest plays out like an HMM
![Page 73: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/73.jpg)
73
Decision Tree example
Tag-1 = JJ?
YES NO
A tree for nouns:
Tag-2 = DT?YES NO
NN = .85JJ = .14…
…
…
![Page 74: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/74.jpg)
74
Other techniques Neural networks Conditional random fields A network of linear functions
Basically, whatever techniques are being used in machine learning are appropriate techniques to try with POS tagging
![Page 75: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/75.jpg)
75
Unsupervised learning Unsupervised learning:
Use an unannotated corpus for training data Instead, will have to use another database of
knowledge, such as a dictionary of possible tags Unsupervised learning use the same general
techniques as supervised, but there are differences
Advantage is that there is more unannotated data to learn from
And annotated data isn’t always available
![Page 76: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/76.jpg)
76
Unsupervised Learning #1: HMMs
We still want to use transition (P(ti|ti-1)) and emission (P(w|t)) probabilities, but how do we obtain them?
During training, the values of the states (i.e., the tags) are unknown, or hidden
Start by using a dictionary, to at least estimate the emission probability
Using the forward-backward algorithm, iteratively derive better probability estimates
The best tag may change with each iteration Stop when probability of the sequence of words
has been (locally) maximized
![Page 77: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/77.jpg)
77
Forward-Backward Algorithm
The Forward-Backward (or Baum-Welch) Algorithm for POS tagging works roughly as follows:
1. Initialize all parameters (forward and backward probability estimates)
2. Calculate new probabilities A. Re-estimate the probability for a given word in the sentence
from the forward and backward probabilities B. Use that to re-estimate the forward and backward parameters
3. Terminate when a local maximum for the string is reached NB: The initial probabilities for words are based on
dictionaries … which you probably need tagged text to obtain
![Page 78: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/78.jpg)
78
Unsupervised Learning #2: TBL
With TBL, we want to learn rules of patterns, but how can we learn the rules if there’s no annotated data?
Main idea: look at the distribution of unambiguous words to guide the disambiguation of ambiguous words
Example: the can, where can can be a noun, modal, or verb
Let’s take unambiguous words from dictionary and count their occurrences after the
the elephant the guardian
Conclusion: immediately after the, nouns are more common than verbs or modals
![Page 79: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/79.jpg)
79
Unsupervised TBL Initial state annotator
Supervised: assign random tag to each word Unsupervised: for each word, list all tags in
dictionary
The templates change accordingly … Transformation template:
Change tag of word to tag Y if the previous (next) tag (word) is Z, where is a set of 1 or more tags
Don’t change any other tags
![Page 80: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/80.jpg)
80
Error Reduction in Unsupervised Method
Let a rule to change to Y in context C be represented as Rule(, Y, C).
Rule1: {VB, MD, NN} NN PREVWORD the Rule2: {VB, MD, NN} VB PREVWORD the
Idea: since annotated data isn’t available, score rules so as to
prefer those where Y appears much more frequently in the context C than all others in
frequency is measured by counting unambiguously tagged words so, prefer {VB, MD, NN} NN PREVWORD the
to {VB, MD, NN} VB PREVWORD the
since dict-unambiguous nouns are more common in a corpus after the than dict-unambiguous verbs
![Page 81: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/81.jpg)
81
Using weakly supervised methods
Can mix supervised and unsupervised to improve performance. For the TBL tagging:
Use the unsupervised method for the initial state tagger
Use the supervised method to adjust the tags Advantage of using weakly supervised over
just supervised: Lots of training data available for unsupervised part Less training data available for supervised part
![Page 82: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/82.jpg)
82
Voting Taggers A more recent technique which seems to improve
performance is to set up a system of tagger voting. Train multiple taggers on a corpus and learn not only
about individual performance, but how they interact Can vote in different ways
Each tagger gets a fraction of the vote For certain tagging decisions, one tagger is always right
Works best when the taggers are looking at different, or complementary, information
Intuition: if taggers A, B, and C are all looking at different info and two of them agree, that’s better than the one that disagrees
![Page 83: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/83.jpg)
83
Ambiguity classes Another way that POS tagging methods have
been improved is by grouping words together in classes
Words are grouped together if they have the same set of possible tags
Can calculate new probabilities based on these classes
This can help if a word occurs very rarely but other members in its class are more frequent
![Page 84: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/84.jpg)
84
Summary: POS tagging Part-of-speech tagging can use
Hand-crafted rules based on inspecting a corpus Machine Learning-based approaches based on corpus
statistics Machine Learning-based approaches using rules derived
automatically from a corpus And almost any other Machine Learning technique you can
think of When Machine Learning is used, it can be
Supervised (using an annotated corpus) Unsupervised (using an unannotated corpus)
Combinations of different methods often improve performance
![Page 85: Chapter 5: POS Tagging Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran.](https://reader035.fdocuments.net/reader035/viewer/2022062304/56649e2d5503460f94b1cfe0/html5/thumbnails/85.jpg)
85
Practice 5.1, 5.2 Small project: 5.8 on Persian (delivery on 30
Azar)