Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small...

48
ISI USC INFORMATION SCIENCES INSTITUTE Daniel Marcu Learning To Deal With Little Or No Annotated Data Daniel Marcu Information Sciences Institute and Department of Computer Science University of Southern California 4676 Admiralty Way, Suite 1001 Marina del Rey, CA 90292 [email protected] http://www.isi.edu/~marcu/

Transcript of Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small...

Page 1: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Learning To Deal With Little Or No Annotated Data

Daniel MarcuInformation Sciences Institute and Department of Computer Science

University of Southern California

4676 Admiralty Way, Suite 1001

Marina del Rey, CA 90292

[email protected]

http://www.isi.edu/~marcu/

Page 2: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Overview• “There is no better data than more data.”

• Annotating data is more cost effective than writing rules manually.– Still, annotating data is expensive

• How can we annotate as little data as possible?– Active Learning

– Bootstrapping

– Co-training

• Unsupervised Learning.– Pattern Discovery

– Hidden Variables (the EM algorithm)

• Corpus Exploitation for Summarization.

Page 3: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Choosing between confusables[Banko and Brill, ACL-2001]

• (two, too, to) (principal, principle) (then, than)

Page 4: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Base Noun Phrase Chunking[Ngai and Yarowsky, ACL-2000]

• Asked human judges to write rules that can be used to identify base noun phrases and automatically integrated those rules into a rule-based chunker.

• Asked human judges to annotate base noun phrases in naturally occurring text and trained a ML-based system to recognize these phrases.

• Compared the performance of the two rule- and ML-based systems.

Page 5: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

It pays off to annotate data

Page 6: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

It matters who annotates the data

Page 7: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

How can we do well while annotating less data?• Active learning

– Active learning with one classifier

– Active learning with a committee of classifiers

• Bootstrapping– Bootstrapping with one classifier

– Bootstrapping with a committee of classifiers

• Co-training

Page 8: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Active learning with one classifierInput: small annotated corpus +

large un-annotated corpus.

3. Train classifier on annotated data.

4. Apply classifier on unlabeled examples.

5. Elicit human judgments for examples on which classifier had the lowest confidence.

6. Add new labeled data to the annotated corpus.

7. Retrain classifier and test on held-out data.

8. If improvement, go to 2.

Page 9: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Active learning with multiple classifiersInput: small annotated corpus +

large un-annotated corpus.

3. Train multiple classifiers on annotated data.

4. Apply classifiers on unlabeled examples.

5. Elicit human judgments for examples on which classifiers agree the least.

6. Add new labeled data to the annotated corpus.

7. Retrain classifier and test on held-out data.

8. If improvement, go to 2.

Page 10: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Active learning helps [Banko and Brill, ACL-2001]

Page 11: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Active learning helps [Ngai and Yarowsky, NAACL-2000]

Page 12: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Active learning worked in all cases that I know of.

Page 13: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Bootstrapping with one classifier

Input: small annotated corpus +

large un-annotated corpus.

3. Train classifier on annotated data.

4. Apply classifier on unlabeled examples.

5. Add to the training corpus the examples that are labeled with high confidence.

6. Retrain classifier (and test on held-out data).

7. If improvement, go to 2.

Page 14: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Bootstrapping with multiple classifiersInput: small annotated corpus +

large un-annotated corpus.3. Train classifiers on annotated data.4. Apply classifiers on unlabeled examples.5. Add to the training corpus the examples that

are given the same label by all (most of) the classifiers.

6. Retrain classifiers (and test on held-out data).

7. If improvement, go to 2.

Page 15: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Bootstrapping example [Yarowsky – ACL-95]

• Extract from a corpus all instances of a polysemous word (7538 instances of plant).

Sense Training Examples

? company said the plant is still operating

? Although thousands of plant and animal species

? zonal distribution of plant life

? to strain microscopic plant life from the

? Nissan car and truck plant in Japan

? discovered at a St. Louis plant manufacturing

? automated manufacturing plant in Fremont

Page 16: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Start with a simple classifier and create a seed corpus

Sense Training Examples

AAA

zonal distribution of plant life to strain microscopic plant life from the …

????

Nissan car and truck plant in Japan company said the plant is still operating Although thousands of plant and animal species …

BBB

discovered at a St. Louis plant manufacturingautomated manufacturing plant in Fremont …

• Start with a simple classifier: plant A; manufacturing B 82 examples of living plants (1%) 106 examples of manufacturing plants (1%) 7360 residual examples

Page 17: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Seed corpus

Page 18: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

1. Train supervised classifier on seed corpus

Collocation Sense

plant life A

manufacturing plant B

life (within 2-10 words) A

manufacturing (within 2-10 words) B

animal (within 2-10 words) A

equipment (within 2-10 words) B

Page 19: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

2. Apply classifier on entire data

Page 20: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Rest of the algorithm

1. Optionally use one-sense-per-discourse filter and augment labeled data

2. Repeat steps 1, 2, 3 iteratively.

Evaluation:

• Baseline: 63.9%

• Supervised: 96.1%

• Bootstrapping: 96.5%

Page 21: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Bootstrapping does not work in all cases (than vs. then [Banko and Brill, ACL-2001])

Test accuracy

% Total training data

106 with labeled seed corpus 0.9624 0.1

Seed + 5 x 106 unsupervised 0.9588 0.6

Seed + 107 unsupervised 0.9620 1.2

Seed + 108 unsupervised 0.9715 12.2

Seed + 5 x 108 unsupervised 0.9588 61.1

109 supervised 0.9878 100.0

Page 22: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Co-training [Blum and Mitchell, COLT-1998]

Professor John Smith

I teach computer courses and advise students including

Mary KaeBill Blue

I work on the following projects:- machine learning for web classification- active learning for NLP- software engineering

My advisorProfessor Smith

Page 23: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Co-training• Input: L – set of labeled training examples

U – set of unlabeled examples

• Loop:– Learn hyperlink-based classifier H from L.

– Learn full-text classifier F from L.

– Allow H to label p positive and n negative examples from U (same distribution as in L).

– Allow F to label p positive and n negative examples from U.

– Add these self-labeled examples to L.

• Why does this work?– Examples that are easy to label by classifier X may be hard cases for

the classifier Y. Classifier Y may learn something new from the examples labeled by X.

Page 24: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Example: Error rates for a web classifier

Page-based classifier

Hyperlink-based classifier

Combined classifier (equal votes)

Supervised training

12.9 12.4 11.1

Co-training 6.2 11.6 5.0

Problem: classify web pages as academic course (yes or no).Data: 16 labeled examples and 800 unlabeled examples taken from one department.

Page 25: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Co-training does not work in all cases [Pierce and Cardie, EMNLP-2001]

Task: identification of base nouns based on left and right context words.

Page 26: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Unsupervised learning

• Pattern discovery.– Language modeling (text as sequence of words)

– Unsupervised induction of syntactic structure.

– Unsupervised induction of POS taggers and base noun identifiers for non-English.

• Hidden variables: the EM algorithm.

Page 27: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Language modeling

• as soon as _______• I would like ______

• P(w1, w2, w3, …, wn) • Useful in

– Speech recognition– Machine translation– Summarization/Generation– Any application in which we produce text

Page 28: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

N-grams models• P(w1, w2, …, wn) = P(w1) P(w2 | w1) P(w3 | w1, w2)

P(w4 | w1, w2, w3 ) … P(wn | w1, w2, …, wn-1)

P(w1) P(w2 | w1) P(w3 | w1, w2)

P(w4 | w2, w3 ) … P(wn | wn-2, wn-1)

• Estimation: P(c | a, b) = count(a, b, c)/count(a, b)• Smoothing when count(a,b) or count(a,b,c) are 0.• Still the most popular language model: never underestimate

the power of n-grams.• Syntax-based language models: [Chelba and Jelinek, ACL-

1998], [Roark, CL-2001], [Charniak, ACL-2001].

Page 29: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Unsupervised induction of syntactic structures [van Zaanen – ICML-2000]• [Harris, 1951] Methods in structural linguistics.

University of Chicago Press: “Two constituents of the same type can be replaced”.

• IDEA: – Find in a corpus parts of sentences that can be replaced and

assume that these parts are syntactic constituents.

• Example: – Show me (flights from Atlanta to Boston.)– Show me (the rates for flight 1943.)

– (Book Delta 128) from Dallas to Boston.– (Give me all flights) from Dallas to Boston.

Page 30: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Algorithm

1. Find overlapping segments in all sentence pairs (string edit distance).

• Dissimilar parts are considered possible constituents and are assigned unique types (labels: X1, X2…).

2. When multiple overlaps occur use various selection criteria

• First learned constituent is good.• Constituent that occurs most often is good.

3. Apply steps 1 and 2 recursively.

Page 31: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Evaluation• ATIS corpus: 716 sentences with 11,777 constituents.• Examples:

– Corpus: What is (the name of (the airport in Boston)NP )NP

– Learned: What is the (name of the (airport in Boston)C )C

– Corpus: Explain classes ((QW)NP and (QX)NP and (Y)NP)NP

– Learned: Explain classes QW and (QX and (Y)C )C

• Non-crossing bracket precision: 86.47• Non-crossing bracket recall: 86.78• Lots of room for improvement:

– Weakening the exact match.– Large scale experiments.

Page 32: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Induction of POS taggers and base noun identifiers for

non-English languages [Yarowsky and Ngai – NAACL-2001]

• For many languages, no NLP analyzers exist.

• Bottleneck: lack of labeled data.

• IDEA: use parallel corpora and existing statistical machine translation software/techniques to automatically label non-English texts.

Page 33: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Projecting POS tag and base noun-phrase structure across languages

Page 34: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Difficulties

• Statistical MT alignment programs yield relatively low accuracy word alignments.

• Very often translations are not literal.

• Mismatch between the annotation needs of two languages (gender in French and English).

Page 35: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

POS tagger induction• Run GIZA (www.isi.edu/natural-language/projects/rewrite) on parallel

corpus of 2M words.• Run POS tagger on English text.• Automatically induce tags for the French.• Train probabilistic noisy-channel tagger on automatically

induced French tags.– Downweight or exclude from the training data the segments that are

likely to be aligned poorly.– Train lexical priors P(t | w) and tag sequence models P(t2 | t1) using

aggressive generalization techniques (most words have one possible core tag).

• Test performance on held-out data and out-of-domain manually annotated data (100k: U. Montreal)

Page 36: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Evaluation• E-F Aligned French

– Direct transfer: 76%

– Standard noisy-channel: 86%

– Noise-robust noisy-channel: 96%

– Upperbound (trained on heldout goldstandard): 97%

• Out-of-domain data:– Standard noisy-channel: 82%

– Noise-robust noisy-channel: 94%

– Upperbound (trained on heldout goldstandard): 98%

Page 37: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

NP bracketer induction

• Tag and bracket English text [Brill,CL-1999; Ramshaw and Marcus, VLC-1999]

• Induce maximal brackets in French/Chinese.

• Train transformation-based learning (TBL) bracketer on French/Chinese data.

• Test performance on small corpus of held out sentences (no French or Chinese NP bracketer exists).

Page 38: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Evaluation on 50 French sentences

– Direct, F-measure: Exact --- 45% Acceptable --- 59%

– TBL, F-measure: Exact --- 81% Acceptable --- 91%

Page 39: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Hidden variables: the EM algorithm [Knight, AI Magazine, 1997]

1a. ok-voon ororok sprok .1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok .4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok .5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok .6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .12b. wat nnat forat arrat vat gat .

Page 40: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

EM (Estimation Maximization)[Dempster, Laird, Rubin; JRST, 1977]

• EM is good for solving “chicken and egg” problems.– Translation:

• If we knew the word level alignments in a corpus, we would know to estimate t(f | e).

• If we knew t(f | e), we would be able to find the word-level alignments in a corpus.

• Problem to solve: find the word-level alignments and the translation probabilities given this corpus:

1e: b c

1f: x y P(a,f | e) = j=1,m t(fj | eaj)-----------2e: b2f: y

Page 41: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

The EM algorithm [Knight, 1999 SMT Tutorial Book]

Step 1: Set parameters uniformly.

t(x | b ) = ½t(y | b) = ½t(x | c) = ½t(y | c) = ½

Step 2: Compute P(a, f | e) for all alignments.

b c b| | P(a,f | e) = ½ * ½ = ¼ | P(a, f | e) = 1/2x y y

b c P(a, f | e) = ½ * ½ = 1/4x y

Page 42: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

The EM algorithm

Step 3: Normalize P(a, f | e) values to yield P(a | e, f)

b c b| | P(a| e, f) = ¼ / ( ¼ + ¼) = ½ | P(a, f | e) = ½ / ½ = 1x y y

b c P(a, f | e) = ¼ / ( ¼ + ¼) = ½ x y

Step 4: Collect fractional counts

tc(x | b ) = ½tc(y | b) = ½ + 1 = 3/2 tc(x | c) = ½tc(y | c) = ½

Page 43: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

The EM algorithmStep 5: Normalize fractional counts to get revised parameter values.

t(x | b) = ½ / ( 3/2 + ½) = ¼t(y | b) = 3/2 / (3/2 + ½) = ¾t(x | c) = ½ / 1 = ½t(y | c) = ½ / 1 = ½

Repeat step 2: Compute P(a, f | e) for all alignments.

b c b| | P(a,f | e) = ¼ * ½ = 1/8 | P(a, f | e) = 3/4x y y

b c P(a, f | e) = 3/4 * ½ = 3/8x y

Page 44: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

The EM algorithm

Repeat step 3: Normalize P(a, f | e) values to yield P(a | e, f)

b c b| | P(a| e, f) = 1/8 / (1/8 + 3/8) = ¼ | P(a, f | e) = 1x y y

b c P(a, f | e) = ¾ x y

Repeat step 4: Collect fractional counts

tc(x | b ) = ¼ tc(y | b) = ¾ + 1 = 7/4tc(x | c) = ¾ tc(y | c) = ¼

Page 45: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

The EM algorithmStep 5: Normalize fractional counts to get revised parameter values.

t(x | b) = 1/8t(y | b) = 7/8t(x | c) = 3/4t(y | c) = 1/4

Repeat steps 2-5 many times:

t(x | b) = 0.0001t(y | b) = 0.9999t(x | c) = 0.9999t(y | c) = 0.0001

At each step, the EM algorithm improves P(f | e) for the whole corpus.

Page 46: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

EM allows one to make MLE under adverse circumstances [Pedersen, EMNLP-2001 EM Panel]

• MLE (Maximum Likelihood Estimates)– Parameters describe the characteristics of a

population. Their values are estimated from samples collected from that population.

– A MLE is a parameter estimate that is most consistent with the sampled data. It maximizes the likelihood of the data P(X | ).Θ

• ΘML = argmax Θ L(X | ).Θ

Page 47: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

Trivial example: coin tossing

• 10 trials: h, t, t, t, h, t, t, h, t, t

• One parameter: = P(h)

• The MLE is 3/10.

• Explanation:– Given 10 tosses, how likely it is to get 3 heads?

L( ) = C103 3 (1– )7

– Take the derivative of the log L( )

Page 48: Learning To Deal With Little Or No Annotated DataActive learning with one classifier Input: small annotated corpus + large un-annotated corpus. 3. Train classifier on annotated data.

ISIUSC INFORMATION SCIENCES INSTITUTE Daniel Marcu

EM: a more complex example• Most often, for multinomial distributions it is not

possible to find the MLE using closed form formulas.

• E-step: Find the expected values of the complete data, given the incomplete data and the current parameter estimates (steps 2 and 3)

• M-step: Compute MLE as usual (steps 4 and 5).

1e: b c Parameters: = {t(x | b), t(x | c), t(y | b), t(y | c)}1f: x y L(X | ) = P(e | f) = a P(a, f | e)-----------2e: b2f: yMaximizing L(X | ) has no closed form solution in this case.