Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small...
Transcript of Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small...
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Learning To Deal With Little Or No Annotated Data
Daniel MarcuInformation Sciences Institute and Department of Computer Science
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292
http://www.isi.edu/~marcu/
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Overview• “There is no better data than more data.”
• Annotating data is more cost effective than writing rules manually.– Still, annotating data is expensive
• How can we annotate as little data as possible?– Active Learning
– Bootstrapping
– Co-training
• Unsupervised Learning.– Pattern Discovery
– Hidden Variables (the EM algorithm)
• Corpus Exploitation for Summarization.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Choosing between confusables[Banko and Brill, ACL-2001]
• (two, too, to) (principal, principle) (then, than)
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Base Noun Phrase Chunking[Ngai and Yarowsky, ACL-2000]
• Asked human judges to write rules that can be used to identify base noun phrases and automatically integrated those rules into a rule-based chunker.
• Asked human judges to annotate base noun phrases in naturally occurring text and trained a ML-based system to recognize these phrases.
• Compared the performance of the two rule- and ML-based systems.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
It pays off to annotate data
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
It matters who annotates the data
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
How can we do well while annotating less data?• Active learning
– Active learning with one classifier
– Active learning with a committee of classifiers
• Bootstrapping– Bootstrapping with one classifier
– Bootstrapping with a committee of classifiers
• Co-training
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Active learning with one classifierInput: small annotated corpus +
large un-annotated corpus.
3. Train classifier on annotated data.
4. Apply classifier on unlabeled examples.
5. Elicit human judgments for examples on which classifier had the lowest confidence.
6. Add new labeled data to the annotated corpus.
7. Retrain classifier and test on held-out data.
8. If improvement, go to 2.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Active learning with multiple classifiersInput: small annotated corpus +
large un-annotated corpus.
3. Train multiple classifiers on annotated data.
4. Apply classifiers on unlabeled examples.
5. Elicit human judgments for examples on which classifiers agree the least.
6. Add new labeled data to the annotated corpus.
7. Retrain classifier and test on held-out data.
8. If improvement, go to 2.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Active learning helps [Banko and Brill, ACL-2001]
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Active learning helps [Ngai and Yarowsky, NAACL-2000]
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Active learning worked in all cases that I know of.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Bootstrapping with one classifier
Input: small annotated corpus +
large un-annotated corpus.
3. Train classifier on annotated data.
4. Apply classifier on unlabeled examples.
5. Add to the training corpus the examples that are labeled with high confidence.
6. Retrain classifier (and test on held-out data).
7. If improvement, go to 2.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Bootstrapping with multiple classifiersInput: small annotated corpus +
large un-annotated corpus.3. Train classifiers on annotated data.4. Apply classifiers on unlabeled examples.5. Add to the training corpus the examples that
are given the same label by all (most of) the classifiers.
6. Retrain classifiers (and test on held-out data).
7. If improvement, go to 2.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Bootstrapping example [Yarowsky – ACL-95]
• Extract from a corpus all instances of a polysemous word (7538 instances of plant).
Sense Training Examples
? company said the plant is still operating
? Although thousands of plant and animal species
? zonal distribution of plant life
? to strain microscopic plant life from the
? Nissan car and truck plant in Japan
? discovered at a St. Louis plant manufacturing
? automated manufacturing plant in Fremont
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Start with a simple classifier and create a seed corpus
Sense Training Examples
AAA
zonal distribution of plant life to strain microscopic plant life from the …
????
Nissan car and truck plant in Japan company said the plant is still operating Although thousands of plant and animal species …
BBB
discovered at a St. Louis plant manufacturingautomated manufacturing plant in Fremont …
• Start with a simple classifier: plant A; manufacturing B 82 examples of living plants (1%) 106 examples of manufacturing plants (1%) 7360 residual examples
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Seed corpus
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
1. Train supervised classifier on seed corpus
Collocation Sense
plant life A
manufacturing plant B
life (within 2-10 words) A
manufacturing (within 2-10 words) B
animal (within 2-10 words) A
equipment (within 2-10 words) B
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
2. Apply classifier on entire data
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Rest of the algorithm
1. Optionally use one-sense-per-discourse filter and augment labeled data
2. Repeat steps 1, 2, 3 iteratively.
Evaluation:
• Baseline: 63.9%
• Supervised: 96.1%
• Bootstrapping: 96.5%
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Bootstrapping does not work in all cases (than vs. then [Banko and Brill, ACL-2001])
Test accuracy
% Total training data
106 with labeled seed corpus 0.9624 0.1
Seed + 5 x 106 unsupervised 0.9588 0.6
Seed + 107 unsupervised 0.9620 1.2
Seed + 108 unsupervised 0.9715 12.2
Seed + 5 x 108 unsupervised 0.9588 61.1
109 supervised 0.9878 100.0
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Co-training [Blum and Mitchell, COLT-1998]
Professor John Smith
I teach computer courses and advise students including
Mary KaeBill Blue
I work on the following projects:- machine learning for web classification- active learning for NLP- software engineering
My advisorProfessor Smith
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Co-training• Input: L – set of labeled training examples
U – set of unlabeled examples
• Loop:– Learn hyperlink-based classifier H from L.
– Learn full-text classifier F from L.
– Allow H to label p positive and n negative examples from U (same distribution as in L).
– Allow F to label p positive and n negative examples from U.
– Add these self-labeled examples to L.
• Why does this work?– Examples that are easy to label by classifier X may be hard cases for
the classifier Y. Classifier Y may learn something new from the examples labeled by X.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Example: Error rates for a web classifier
Page-based classifier
Hyperlink-based classifier
Combined classifier (equal votes)
Supervised training
12.9 12.4 11.1
Co-training 6.2 11.6 5.0
Problem: classify web pages as academic course (yes or no).Data: 16 labeled examples and 800 unlabeled examples taken from one department.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Co-training does not work in all cases [Pierce and Cardie, EMNLP-2001]
Task: identification of base nouns based on left and right context words.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Unsupervised learning
• Pattern discovery.– Language modeling (text as sequence of words)
– Unsupervised induction of syntactic structure.
– Unsupervised induction of POS taggers and base noun identifiers for non-English.
• Hidden variables: the EM algorithm.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Language modeling
• as soon as _______• I would like ______
• P(w1, w2, w3, …, wn) • Useful in
– Speech recognition– Machine translation– Summarization/Generation– Any application in which we produce text
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
N-grams models• P(w1, w2, …, wn) = P(w1) P(w2 | w1) P(w3 | w1, w2)
P(w4 | w1, w2, w3 ) … P(wn | w1, w2, …, wn-1)
P(w1) P(w2 | w1) P(w3 | w1, w2)
P(w4 | w2, w3 ) … P(wn | wn-2, wn-1)
• Estimation: P(c | a, b) = count(a, b, c)/count(a, b)• Smoothing when count(a,b) or count(a,b,c) are 0.• Still the most popular language model: never underestimate
the power of n-grams.• Syntax-based language models: [Chelba and Jelinek, ACL-
1998], [Roark, CL-2001], [Charniak, ACL-2001].
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Unsupervised induction of syntactic structures [van Zaanen – ICML-2000]• [Harris, 1951] Methods in structural linguistics.
University of Chicago Press: “Two constituents of the same type can be replaced”.
• IDEA: – Find in a corpus parts of sentences that can be replaced and
assume that these parts are syntactic constituents.
• Example: – Show me (flights from Atlanta to Boston.)– Show me (the rates for flight 1943.)
– (Book Delta 128) from Dallas to Boston.– (Give me all flights) from Dallas to Boston.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Algorithm
1. Find overlapping segments in all sentence pairs (string edit distance).
• Dissimilar parts are considered possible constituents and are assigned unique types (labels: X1, X2…).
2. When multiple overlaps occur use various selection criteria
• First learned constituent is good.• Constituent that occurs most often is good.
3. Apply steps 1 and 2 recursively.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Evaluation• ATIS corpus: 716 sentences with 11,777 constituents.• Examples:
– Corpus: What is (the name of (the airport in Boston)NP )NP
– Learned: What is the (name of the (airport in Boston)C )C
– Corpus: Explain classes ((QW)NP and (QX)NP and (Y)NP)NP
– Learned: Explain classes QW and (QX and (Y)C )C
• Non-crossing bracket precision: 86.47• Non-crossing bracket recall: 86.78• Lots of room for improvement:
– Weakening the exact match.– Large scale experiments.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Induction of POS taggers and base noun identifiers for
non-English languages [Yarowsky and Ngai – NAACL-2001]
• For many languages, no NLP analyzers exist.
• Bottleneck: lack of labeled data.
• IDEA: use parallel corpora and existing statistical machine translation software/techniques to automatically label non-English texts.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Projecting POS tag and base noun-phrase structure across languages
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Difficulties
• Statistical MT alignment programs yield relatively low accuracy word alignments.
• Very often translations are not literal.
• Mismatch between the annotation needs of two languages (gender in French and English).
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
POS tagger induction• Run GIZA (www.isi.edu/natural-language/projects/rewrite) on parallel
corpus of 2M words.• Run POS tagger on English text.• Automatically induce tags for the French.• Train probabilistic noisy-channel tagger on automatically
induced French tags.– Downweight or exclude from the training data the segments that are
likely to be aligned poorly.– Train lexical priors P(t | w) and tag sequence models P(t2 | t1) using
aggressive generalization techniques (most words have one possible core tag).
• Test performance on held-out data and out-of-domain manually annotated data (100k: U. Montreal)
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Evaluation• E-F Aligned French
– Direct transfer: 76%
– Standard noisy-channel: 86%
– Noise-robust noisy-channel: 96%
– Upperbound (trained on heldout goldstandard): 97%
• Out-of-domain data:– Standard noisy-channel: 82%
– Noise-robust noisy-channel: 94%
– Upperbound (trained on heldout goldstandard): 98%
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
NP bracketer induction
• Tag and bracket English text [Brill,CL-1999; Ramshaw and Marcus, VLC-1999]
• Induce maximal brackets in French/Chinese.
• Train transformation-based learning (TBL) bracketer on French/Chinese data.
• Test performance on small corpus of held out sentences (no French or Chinese NP bracketer exists).
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Evaluation on 50 French sentences
– Direct, F-measure: Exact --- 45% Acceptable --- 59%
– TBL, F-measure: Exact --- 81% Acceptable --- 91%
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Hidden variables: the EM algorithm [Knight, AI Magazine, 1997]
1a. ok-voon ororok sprok .1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .9b. totat nnat quat oloat at-yurp .
4a. ok-voon anok drok brok jok .4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .10b. wat nnat gat mat bat hilat .
5a. wiwok farok izok stok .5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .11b. wat nnat arrat mat zanzanat .
6a. lalok sprok izok jok stok .6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .12b. wat nnat forat arrat vat gat .
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
EM (Estimation Maximization)[Dempster, Laird, Rubin; JRST, 1977]
• EM is good for solving “chicken and egg” problems.– Translation:
• If we knew the word level alignments in a corpus, we would know to estimate t(f | e).
• If we knew t(f | e), we would be able to find the word-level alignments in a corpus.
• Problem to solve: find the word-level alignments and the translation probabilities given this corpus:
1e: b c
1f: x y P(a,f | e) = j=1,m t(fj | eaj)-----------2e: b2f: y
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
The EM algorithm [Knight, 1999 SMT Tutorial Book]
Step 1: Set parameters uniformly.
t(x | b ) = ½t(y | b) = ½t(x | c) = ½t(y | c) = ½
Step 2: Compute P(a, f | e) for all alignments.
b c b| | P(a,f | e) = ½ * ½ = ¼ | P(a, f | e) = 1/2x y y
b c P(a, f | e) = ½ * ½ = 1/4x y
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
The EM algorithm
Step 3: Normalize P(a, f | e) values to yield P(a | e, f)
b c b| | P(a| e, f) = ¼ / ( ¼ + ¼) = ½ | P(a, f | e) = ½ / ½ = 1x y y
b c P(a, f | e) = ¼ / ( ¼ + ¼) = ½ x y
Step 4: Collect fractional counts
tc(x | b ) = ½tc(y | b) = ½ + 1 = 3/2 tc(x | c) = ½tc(y | c) = ½
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
The EM algorithmStep 5: Normalize fractional counts to get revised parameter values.
t(x | b) = ½ / ( 3/2 + ½) = ¼t(y | b) = 3/2 / (3/2 + ½) = ¾t(x | c) = ½ / 1 = ½t(y | c) = ½ / 1 = ½
Repeat step 2: Compute P(a, f | e) for all alignments.
b c b| | P(a,f | e) = ¼ * ½ = 1/8 | P(a, f | e) = 3/4x y y
b c P(a, f | e) = 3/4 * ½ = 3/8x y
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
The EM algorithm
Repeat step 3: Normalize P(a, f | e) values to yield P(a | e, f)
b c b| | P(a| e, f) = 1/8 / (1/8 + 3/8) = ¼ | P(a, f | e) = 1x y y
b c P(a, f | e) = ¾ x y
Repeat step 4: Collect fractional counts
tc(x | b ) = ¼ tc(y | b) = ¾ + 1 = 7/4tc(x | c) = ¾ tc(y | c) = ¼
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
The EM algorithmStep 5: Normalize fractional counts to get revised parameter values.
t(x | b) = 1/8t(y | b) = 7/8t(x | c) = 3/4t(y | c) = 1/4
Repeat steps 2-5 many times:
t(x | b) = 0.0001t(y | b) = 0.9999t(x | c) = 0.9999t(y | c) = 0.0001
At each step, the EM algorithm improves P(f | e) for the whole corpus.
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
EM allows one to make MLE under adverse circumstances [Pedersen, EMNLP-2001 EM Panel]
• MLE (Maximum Likelihood Estimates)– Parameters describe the characteristics of a
population. Their values are estimated from samples collected from that population.
– A MLE is a parameter estimate that is most consistent with the sampled data. It maximizes the likelihood of the data P(X | ).Θ
• ΘML = argmax Θ L(X | ).Θ
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
Trivial example: coin tossing
• 10 trials: h, t, t, t, h, t, t, h, t, t
• One parameter: = P(h)
• The MLE is 3/10.
• Explanation:– Given 10 tosses, how likely it is to get 3 heads?
L( ) = C103 3 (1– )7
– Take the derivative of the log L( )
ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu
EM: a more complex example• Most often, for multinomial distributions it is not
possible to find the MLE using closed form formulas.
• E-step: Find the expected values of the complete data, given the incomplete data and the current parameter estimates (steps 2 and 3)
• M-step: Compute MLE as usual (steps 4 and 5).
1e: b c Parameters: = {t(x | b), t(x | c), t(y | b), t(y | c)}1f: x y L(X | ) = P(e | f) = a P(a, f | e)-----------2e: b2f: yMaximizing L(X | ) has no closed form solution in this case.