Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small...

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Learning To Deal With Little Or No Annotated Data

Daniel MarcuInformation Sciences Institute and Department of Computer Science

University of Southern California

4676 Admiralty Way, Suite 1001

Marina del Rey, CA 90292

[email protected]

http://www.isi.edu/~marcu/


Overview• “There is no better data than more data.”

• Annotating data is more cost effective than writing rules manually.– Still, annotating data is expensive

• How can we annotate as little data as possible?– Active Learning

– Bootstrapping

– Co-training

• Unsupervised Learning.– Pattern Discovery

– Hidden Variables (the EM algorithm)

• Corpus Exploitation for Summarization.


Choosing between confusables[Banko and Brill, ACL-2001]

• (two, too, to) (principal, principle) (then, than)


Base Noun Phrase Chunking[Ngai and Yarowsky, ACL-2000]

• Asked human judges to write rules that can be used to identify base noun phrases and automatically integrated those rules into a rule-based chunker.

• Asked human judges to annotate base noun phrases in naturally occurring text and trained a ML-based system to recognize these phrases.

• Compared the performance of the two rule- and ML-based systems.


It pays off to annotate data


It matters who annotates the data


How can we do well while annotating less data?• Active learning

– Active learning with one classifier

– Active learning with a committee of classifiers

• Bootstrapping– Bootstrapping with one classifier

– Bootstrapping with a committee of classifiers

• Co-training


Active learning with one classifierInput: small annotated corpus +

large un-annotated corpus.

3. Train classifier on annotated data.

4. Apply classifier on unlabeled examples.

5. Elicit human judgments for examples on which classifier had the lowest confidence.

6. Add new labeled data to the annotated corpus.

7. Retrain classifier and test on held-out data.

8. If improvement, go to 2.


Active learning with multiple classifiersInput: small annotated corpus +


3. Train multiple classifiers on annotated data.

4. Apply classifiers on unlabeled examples.

5. Elicit human judgments for examples on which classifiers agree the least.

6. Add new labeled data to the annotated corpus.

7. Retrain classifier and test on held-out data.



Active learning helps [Banko and Brill, ACL-2001]


Active learning helps [Ngai and Yarowsky, NAACL-2000]


Active learning worked in all cases that I know of.


Bootstrapping with one classifier

Input: small annotated corpus +


3. Train classifier on annotated data.

4. Apply classifier on unlabeled examples.

5. Add to the training corpus the examples that are labeled with high confidence.

6. Retrain classifier (and test on held-out data).



Bootstrapping with multiple classifiersInput: small annotated corpus +

large un-annotated corpus.3. Train classifiers on annotated data.4. Apply classifiers on unlabeled examples.5. Add to the training corpus the examples that

are given the same label by all (most of) the classifiers.

6. Retrain classifiers (and test on held-out data).



Bootstrapping example [Yarowsky – ACL-95]

• Extract from a corpus all instances of a polysemous word (7538 instances of plant).

Sense Training Examples

? company said the plant is still operating

? Although thousands of plant and animal species

? zonal distribution of plant life

? to strain microscopic plant life from the

? Nissan car and truck plant in Japan

? discovered at a St. Louis plant manufacturing

? automated manufacturing plant in Fremont


Start with a simple classifier and create a seed corpus

Sense Training Examples

AAA

zonal distribution of plant life to strain microscopic plant life from the …

????

Nissan car and truck plant in Japan company said the plant is still operating Although thousands of plant and animal species …

BBB

discovered at a St. Louis plant manufacturingautomated manufacturing plant in Fremont …

• Start with a simple classifier: plant A; manufacturing B 82 examples of living plants (1%) 106 examples of manufacturing plants (1%) 7360 residual examples


Seed corpus


1. Train supervised classifier on seed corpus

Collocation Sense

plant life A

manufacturing plant B

life (within 2-10 words) A

manufacturing (within 2-10 words) B

animal (within 2-10 words) A

equipment (within 2-10 words) B


2. Apply classifier on entire data


Rest of the algorithm

1. Optionally use one-sense-per-discourse filter and augment labeled data

2. Repeat steps 1, 2, 3 iteratively.

Evaluation:

• Baseline: 63.9%

• Supervised: 96.1%

• Bootstrapping: 96.5%


Bootstrapping does not work in all cases (than vs. then [Banko and Brill, ACL-2001])

Test accuracy

% Total training data

106 with labeled seed corpus 0.9624 0.1

Seed + 5 x 106 unsupervised 0.9588 0.6

Seed + 107 unsupervised 0.9620 1.2

Seed + 108 unsupervised 0.9715 12.2

Seed + 5 x 108 unsupervised 0.9588 61.1

109 supervised 0.9878 100.0


Co-training [Blum and Mitchell, COLT-1998]

Professor John Smith

I teach computer courses and advise students including

Mary KaeBill Blue

I work on the following projects:- machine learning for web classification- active learning for NLP- software engineering

My advisorProfessor Smith


Co-training• Input: L – set of labeled training examples

U – set of unlabeled examples

• Loop:– Learn hyperlink-based classifier H from L.

– Learn full-text classifier F from L.

– Allow H to label p positive and n negative examples from U (same distribution as in L).

– Allow F to label p positive and n negative examples from U.

– Add these self-labeled examples to L.

• Why does this work?– Examples that are easy to label by classifier X may be hard cases for

the classifier Y. Classifier Y may learn something new from the examples labeled by X.


Example: Error rates for a web classifier

Page-based classifier

Hyperlink-based classifier

Combined classifier (equal votes)

Supervised training

12.9 12.4 11.1

Co-training 6.2 11.6 5.0

Problem: classify web pages as academic course (yes or no).Data: 16 labeled examples and 800 unlabeled examples taken from one department.


Co-training does not work in all cases [Pierce and Cardie, EMNLP-2001]

Task: identification of base nouns based on left and right context words.


Unsupervised learning

• Pattern discovery.– Language modeling (text as sequence of words)

– Unsupervised induction of syntactic structure.

– Unsupervised induction of POS taggers and base noun identifiers for non-English.

• Hidden variables: the EM algorithm.


Language modeling

• as soon as _______• I would like ______

• P(w1, w2, w3, …, wn) • Useful in

– Speech recognition– Machine translation– Summarization/Generation– Any application in which we produce text


N-grams models• P(w1, w2, …, wn) = P(w1) P(w2 | w1) P(w3 | w1, w2)

P(w4 | w1, w2, w3 ) … P(wn | w1, w2, …, wn-1)

P(w1) P(w2 | w1) P(w3 | w1, w2)

P(w4 | w2, w3 ) … P(wn | wn-2, wn-1)

• Estimation: P(c | a, b) = count(a, b, c)/count(a, b)• Smoothing when count(a,b) or count(a,b,c) are 0.• Still the most popular language model: never underestimate

the power of n-grams.• Syntax-based language models: [Chelba and Jelinek, ACL-

1998], [Roark, CL-2001], [Charniak, ACL-2001].


Unsupervised induction of syntactic structures [van Zaanen – ICML-2000]• [Harris, 1951] Methods in structural linguistics.

University of Chicago Press: “Two constituents of the same type can be replaced”.

• IDEA: – Find in a corpus parts of sentences that can be replaced and

assume that these parts are syntactic constituents.

• Example: – Show me (flights from Atlanta to Boston.)– Show me (the rates for flight 1943.)

– (Book Delta 128) from Dallas to Boston.– (Give me all flights) from Dallas to Boston.


Algorithm

1. Find overlapping segments in all sentence pairs (string edit distance).

• Dissimilar parts are considered possible constituents and are assigned unique types (labels: X1, X2…).

2. When multiple overlaps occur use various selection criteria

• First learned constituent is good.• Constituent that occurs most often is good.

3. Apply steps 1 and 2 recursively.


Evaluation• ATIS corpus: 716 sentences with 11,777 constituents.• Examples:

– Corpus: What is (the name of (the airport in Boston)NP )NP

– Learned: What is the (name of the (airport in Boston)C )C

– Corpus: Explain classes ((QW)NP and (QX)NP and (Y)NP)NP

– Learned: Explain classes QW and (QX and (Y)C )C

• Non-crossing bracket precision: 86.47• Non-crossing bracket recall: 86.78• Lots of room for improvement:

– Weakening the exact match.– Large scale experiments.


Induction of POS taggers and base noun identifiers for

non-English languages [Yarowsky and Ngai – NAACL-2001]

• For many languages, no NLP analyzers exist.

• Bottleneck: lack of labeled data.

• IDEA: use parallel corpora and existing statistical machine translation software/techniques to automatically label non-English texts.


Projecting POS tag and base noun-phrase structure across languages


Difficulties

• Statistical MT alignment programs yield relatively low accuracy word alignments.

• Very often translations are not literal.

• Mismatch between the annotation needs of two languages (gender in French and English).


POS tagger induction• Run GIZA (www.isi.edu/natural-language/projects/rewrite) on parallel

corpus of 2M words.• Run POS tagger on English text.• Automatically induce tags for the French.• Train probabilistic noisy-channel tagger on automatically

induced French tags.– Downweight or exclude from the training data the segments that are

likely to be aligned poorly.– Train lexical priors P(t | w) and tag sequence models P(t2 | t1) using

aggressive generalization techniques (most words have one possible core tag).

• Test performance on held-out data and out-of-domain manually annotated data (100k: U. Montreal)


Evaluation• E-F Aligned French

– Direct transfer: 76%

– Standard noisy-channel: 86%

– Noise-robust noisy-channel: 96%

– Upperbound (trained on heldout goldstandard): 97%

• Out-of-domain data:– Standard noisy-channel: 82%

– Noise-robust noisy-channel: 94%

– Upperbound (trained on heldout goldstandard): 98%


NP bracketer induction

• Tag and bracket English text [Brill,CL-1999; Ramshaw and Marcus, VLC-1999]

• Induce maximal brackets in French/Chinese.

• Train transformation-based learning (TBL) bracketer on French/Chinese data.

• Test performance on small corpus of held out sentences (no French or Chinese NP bracketer exists).


Evaluation on 50 French sentences

– Direct, F-measure: Exact --- 45% Acceptable --- 59%

– TBL, F-measure: Exact --- 81% Acceptable --- 91%


Hidden variables: the EM algorithm [Knight, AI Magazine, 1997]

1a. ok-voon ororok sprok .1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .12b. wat nnat forat arrat vat gat .


EM (Estimation Maximization)[Dempster, Laird, Rubin; JRST, 1977]

• EM is good for solving “chicken and egg” problems.– Translation:

• If we knew the word level alignments in a corpus, we would know to estimate t(f | e).

• If we knew t(f | e), we would be able to find the word-level alignments in a corpus.

• Problem to solve: find the word-level alignments and the translation probabilities given this corpus:

1e: b c

1f: x y P(a,f | e) = j=1,m t(fj | eaj)-----------2e: b2f: y


The EM algorithm [Knight, 1999 SMT Tutorial Book]

Step 1: Set parameters uniformly.

t(x | b ) = ½t(y | b) = ½t(x | c) = ½t(y | c) = ½

Step 2: Compute P(a, f | e) for all alignments.

b c b| | P(a,f | e) = ½ * ½ = ¼ | P(a, f | e) = 1/2x y y

b c P(a, f | e) = ½ * ½ = 1/4x y


The EM algorithmStep 5: Normalize fractional counts to get revised parameter values.

t(x | b) = ½ / ( 3/2 + ½) = ¼t(y | b) = 3/2 / (3/2 + ½) = ¾t(x | c) = ½ / 1 = ½t(y | c) = ½ / 1 = ½

Repeat step 2: Compute P(a, f | e) for all alignments.

b c b| | P(a,f | e) = ¼ * ½ = 1/8 | P(a, f | e) = 3/4x y y

b c P(a, f | e) = 3/4 * ½ = 3/8x y


The EM algorithmStep 5: Normalize fractional counts to get revised parameter values.

t(x | b) = 1/8t(y | b) = 7/8t(x | c) = 3/4t(y | c) = 1/4

Repeat steps 2-5 many times:

t(x | b) = 0.0001t(y | b) = 0.9999t(x | c) = 0.9999t(y | c) = 0.0001

At each step, the EM algorithm improves P(f | e) for the whole corpus.


EM allows one to make MLE under adverse circumstances [Pedersen, EMNLP-2001 EM Panel]

• MLE (Maximum Likelihood Estimates)– Parameters describe the characteristics of a

population. Their values are estimated from samples collected from that population.

– A MLE is a parameter estimate that is most consistent with the sampled data. It maximizes the likelihood of the data P(X | ).Θ

• ΘML = argmax Θ L(X | ).Θ


Trivial example: coin tossing

• 10 trials: h, t, t, t, h, t, t, h, t, t

• One parameter: = P(h)

• The MLE is 3/10.

• Explanation:– Given 10 tosses, how likely it is to get 3 heads?

L( ) = C103 3 (1– )7

– Take the derivative of the log L( )


EM: a more complex example• Most often, for multinomial distributions it is not

possible to find the MLE using closed form formulas.

• E-step: Find the expected values of the complete data, given the incomplete data and the current parameter estimates (steps 2 and 3)

• M-step: Compute MLE as usual (steps 4 and 5).

1e: b c Parameters: = {t(x | b), t(x | c), t(y | b), t(y | c)}1f: x y L(X | ) = P(e | f) = a P(a, f | e)-----------2e: b2f: yMaximizing L(X | ) has no closed form solution in this case.

Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small...

Documents

Transcript of Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small...