Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko,...
-
Upload
bruno-shelton -
Category
Documents
-
view
226 -
download
0
description
Transcript of Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko,...
![Page 1: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/1.jpg)
Part of Speech Tagging in Context
month day, year
Alex [email protected] 575 Winter 08
Michele Banko, Robert Moore
![Page 2: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/2.jpg)
Overview
• Comparison of previous methods• Using context from both sides• Lexicon Construction• Sequential EM for tag sequence and
lexical probabilities• Discussion Questions
![Page 3: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/3.jpg)
Previous methods• Trigram model P(t_i | t_i-1, t_i-2)• Kupiec(1992) divide lexicon into word
classes– Words contained within the same equivalence
classes posses the same set of POS
• Brill(1995) UTBL – Uses information from the distribution of
unambiguously tagged data to make label decision– Considers both left and right context
![Page 4: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/4.jpg)
• Toutanova (2003) Conditional MM– Supervised learning method– Increase accuracy from 96.10% to
96.55%• Lafferty (2001)
– Compared HMM with MEMM, and CRF
![Page 5: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/5.jpg)
Contextualized HMM• Estimate the probability of a word w_i based
on t_i-1, t_i and t_i+1
• Leads to higher dimensionality in the parameters• Standard absolute discounting scheme smoothing
![Page 6: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/6.jpg)
Lexicon construction• Lexicons provided for both testing and
training• Initialize with uniform dist for all
possible tags for each word• Experiments with using word classes
in the Kupiec model
![Page 7: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/7.jpg)
Problems
• Limiting the possible tags per lexicon– Tags that appeared less than X% of the time for
each word are omitted.
![Page 8: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/8.jpg)
![Page 9: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/9.jpg)
HMM Model Training• Extracting non-ambiguous tag sequence
– Use these n-grams and their counts to bias the initial estimate of state transitions in the HMM
• Sequential training– Train the transition model probability first, keeping
the lexical probabilities constant.– Then train the lexical probabilities, keeping the
transition probability constant.
![Page 10: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/10.jpg)
Discussion• Sequential training of HMM by training
the parameters separately. Is there any theoretical significance? Computational cost?
• What are the effects if we model the tag context differently using p(t_i | t_i-1, t_i+1)?
![Page 11: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/11.jpg)
Improved Estimation for Unsupervised POS Tagging
month day, year
Alex [email protected] 575 Winter 08
Qin Iris Wang, Dale Schuurmans
![Page 12: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/12.jpg)
Overview
• Focus on parameter estimation– Considering only simple models with limited
context (using a standard HMM - bigram)• Constraint on marginal tag probabilities• Smooth lexical parameters using word
similarities• Discussion Questions
![Page 13: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/13.jpg)
Parameter Estimation• Banko and Moore (2004) reduces error rate
from 22.8% to 4.1% by reducing the set of possible tags for each word.– Requires tagged data to find the artificially reduced
lexicon.• EM is guaranteed to converge to a local
maximum.• HMM tends to have multiple local maxima.
– This leads to the resulting quality of the parameters may have more to do with the initial parameter estimation than the EM procedure itself.
![Page 14: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/14.jpg)
Estimations problems• Using the standard model
– Tag -> tag unifrom over all tags– Tag -> word uniform over all possible tag for word
(as specified in complete lexicon)• Estimated parameters of the transition
probabilities are quite poor.– ‘a’ is always tagged LS.
• Estimated parameters of the lexical probabilities are also quite poor– Treat each parameter b_t_w1, b_t_w2 as
independent.– EM tends to over-fit the lexical model and ignore
similarity between words.
![Page 15: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/15.jpg)
Marginally Constrained HMMsTag -> Tag probabilities
• Maintain a specific marginal distribution over the tag probabilities.– Assuming we are given a target
distribution over tags (raw tag frequency)• Can be obtained from tagged data• Can be approximated (see Toutanova, 2003)
![Page 16: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/16.jpg)
Similarity based SmoothingTag -> Word probabilities
• Using a feature vector f for each word w which consists of the context (left and right word) of w.
• Took 100,000 most frequent words as features
![Page 17: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/17.jpg)
Result
![Page 18: Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.](https://reader036.fdocuments.net/reader036/viewer/2022081511/5a4d1b587f8b9ab0599aa1a5/html5/thumbnails/18.jpg)
Discussion• Compared to Banko and Moore, are
methods used here “more or less” unsupervised?– Banko and Moore uses lexicon ablation– Here, we use raw frequency of tags