Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong;...

Transcript of Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong;...

Lecture 2: BootstrappingSteven Abney

Presenter: Zhihang Dong;Adv. Natural Lang. Proc.

April 25, 2018

Table of contents

1. Motivation

2. View Independence vs. Rule Independence

3. Assumption Relaxed

4. Greedy Agreement Algorithm and Yarowsky Algorithm

1

Page 3: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Motivation

Page 4: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Bootstrapping: Research Question

Given a small set of labeled data and a large set of unlabeled data,

The goal of bootsrapping is to induce a classifier, such classifier isnecessary because

1. paucity of labeled data

2. plenitude of unlabeled data

... in NLP

2

Page 5: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Bootstrapping: Research Question

Given a small set of labeled data and a large set of unlabeled data,

The goal of bootsrapping is to induce a classifier, such classifier isnecessary because

1. paucity of labeled data2. plenitude of unlabeled data

... in NLP

2

Page 6: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Boostrapping: Settings

Given the unlabeled natural language data X , our purposes are

• given a ”trained” or ”labeled” natural language data X ′ and theircorresponding labels L′

• our goal is to find a function F , that produces labels for theunlabeled NL data: F : X → L

• ... such function map instances to labels through a space S ofclassifiers...

3

Page 7: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

3

Page 8: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

3

Page 9: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Boostrapping: Conditional Independence Assumptions (I)

What is conditional independence?

What is the conditional independence assumptions in the context ofbootstrapping algorithms?

4

Page 10: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Boostrapping: Conditional Independence Assumptions (II)

We can divide the Yarowsky algorithm into two smaller algorithms:

• A decision list algorithm that identify other reliable collocations,which calculates the probability Pr(Sense | Collocation); thisproduces a decision list that is ranked by log-likelihood ratio:

•log

(Pr(SenseA|Collocationi)Pr(SenseB|Collocationi)

)• it uses only the most reliable piece of evidence rather than thewhole matching collocation set

• Then, a smoothing algorithm to avoid 0 values

5

Page 11: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

•log

)

• it uses only the most reliable piece of evidence rather than thewhole matching collocation set

5

Page 12: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

•log

5

Page 13: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

•log

5

Page 14: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Boostrapping: Conditional Independence Assumptions (III)

Another work is followed by (Blum and Mitchell, 1998):

• BM propose a conditional independence assumption to accountfor the efficacy of their algorithm, called co-training

• suggest that the Yarowsky algorithm is a special case of theco-training algorithm

• the author identifies several flaws of B&M’s statements, whichwe will discuss a bit later

6

Page 15: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

6

Page 16: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

6

Page 17: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

View Independence vs. RuleIndependence

Page 18: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

View Independence (I)

In view independence, each instance x consists of two “views” x1, x2.We can take this as the assumption of functions X1 and X2 such thatX1(x) = x1 and X2(x) = x2Definition 1 A pair of views x1, x2 satisfy view independence if:

• Pr[X1 = x1|X2 = x2; Y = y] = Pr[X1 = x1|Y = y]

• Pr[X2 = x2|X1 = x1; Y = y] = Pr[X2 = x2|Y = y]• A classification problem instance satisfies view independence ifall pairs x1, x2 satisfy view independence

7

Page 19: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

• Pr[X1 = x1|X2 = x2; Y = y] = Pr[X1 = x1|Y = y]• Pr[X2 = x2|X1 = x1; Y = y] = Pr[X2 = x2|Y = y]

• A classification problem instance satisfies view independence ifall pairs x1, x2 satisfy view independence

7

Page 20: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

• Pr[X1 = x1|X2 = x2; Y = y] = Pr[X1 = x1|Y = y]• Pr[X2 = x2|X1 = x1; Y = y] = Pr[X2 = x2|Y = y]• A classification problem instance satisfies view independence ifall pairs x1, x2 satisfy view independence

7

Page 21: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

View Independence (II)

The first definition can be expanded into a generalization that:

Definition 2 A pair of rules F ∈ H1, G ∈ H2 satisfies ruleindependence just in case, for all u; v; y

• Pr[F = u|G = v; Y = y] = Pr[F = u|Y = y]

• A classification problem instance satisfies rule independencejust in case all opposing-view rule pairs satisfy ruleindependence

• assume a set of features F : {H1,H2}, rule independencereduces to the Naive Bayes independence assumption

8

Page 22: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

• Pr[F = u|G = v; Y = y] = Pr[F = u|Y = y]• A classification problem instance satisfies rule independencejust in case all opposing-view rule pairs satisfy ruleindependence

8

Page 23: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

• Pr[F = u|G = v; Y = y] = Pr[F = u|Y = y]• A classification problem instance satisfies rule independencejust in case all opposing-view rule pairs satisfy ruleindependence

8

Page 24: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

View Independence (III)

Theorem 1:

View independence implies rule independence

9

Page 25: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Agreement Rates (i): Contrary Argument

Blum and Mitchell’s paper suggests that rules that agree onunlabeled instances are useful in bootstrapping:

Definition 3 The agreement rate between rules F and G is:

Pr[F = G|F;G ̸=⊥]

• ... does not explicitly search for rules with good agreement

• ... does not play any direct role in the learnability proof given inthe B& M paper

• if view independence is satisfied, then the agreement ratebetween opposing-view rules F and G upper bounds the error ofF/G

10

Page 26: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Pr[F = G|F;G ̸=⊥]

• ... does not explicitly search for rules with good agreement• ... does not play any direct role in the learnability proof given inthe B& M paper

10

Page 27: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Pr[F = G|F;G ̸=⊥]

• ... does not explicitly search for rules with good agreement• ... does not play any direct role in the learnability proof given inthe B& M paper

10

Page 28: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Agreement Rates (ii): Minimizer Inequality

Theorem 2

For all F ∈ H1, G ∈ H2 that satisfy rule independence and arenontrivial predictors in the sense that minu Pr[F = u] > Pr[F ̸= G],one of the following inequalities holds:

•Pr[F ̸= Y] ≤ Pr[F ̸= G]

•Pr[F̂ ̸= Y] ≤ Pr[F ̸= G]

• If F agrees with G on all but ε unlabeled instances, then either For F̂ predicts Y with error no greater than ε

11

Page 29: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Theorem 2

•Pr[F ̸= Y] ≤ Pr[F ̸= G]

•Pr[F̂ ̸= Y] ≤ Pr[F ̸= G]

11

Page 30: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Theorem 2

•Pr[F ̸= Y] ≤ Pr[F ̸= G]

•Pr[F̂ ̸= Y] ≤ Pr[F ̸= G]

11

Page 31: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Agreement Rate (iii): Theorem 2 w.r.t. Upper Bound

12

Page 32: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Rule Independence: Problems

Rule independence is a very strong assumption:

• Let F and G be arbitrary rules based on independent views

• The precision of a rule F is defined to be Pr[Y = +|F = +]

• If rule independence holds, we can compute the precision ofevery other rule given unlabeled data and knowledge of the sizeof the target concept with the precision of one rule

• this is easily proven given the knowledge from STAT 001...(Probability lesson 1)

13

Page 33: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

• Let F and G be arbitrary rules based on independent views• The precision of a rule F is defined to be Pr[Y = +|F = +]

13

Page 34: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

13

Page 35: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

13

Page 36: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Empirical Exam: Data

The task is to classify names in text as person, location, ororganization. There is an unlabeled training set containing 89,305instances, and a labeled test set containing 289 persons, 186locations, 402 organizations, and 123 “other”, for a total of 1,000instances.

14

Page 37: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Empirical Exam: Results

15

Page 38: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Assumption Relaxed

Page 39: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

How to relax the assumption

• There are two ways in which the data can diverge fromconditional independence: the rules may either be positively ornegatively correlated, given the class value

• If the rules are negatively correlated, then their disagreement(shaded in figure 3) is larger than if they are conditionallyindependent

• ... but the data is positively correlated

16

Page 40: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Positive Correlation vs. Negative Correlation

17

Page 41: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

A Weaker Assumption

Let us quantify the amount of deviation from conditionalindependence. We define the conditional dependence of F and Ggiven Y = y to be

dy =∑

u,v |Pr[G = v|Y = y; F = u]− Pr[G = v|Y = y]2 |

If F and G are conditionally independent, then dy = 0

Definition 4 Rules F and G satisfy weak rule dependence if, fory ∈ {+,−}

dy ≤ p2q1 − p12p1q1

18

Page 42: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Let’s rewrite Theorem 2!

Theorem 3

For all F ∈ H1, G ∈ H2 that satisfy weak rule dependence and arenontrivial predictors in the sense that minu Pr[F = u] > Pr[F ̸= G],one of the following inequalities holds:

•Pr[F ̸= Y] ≤ Pr[F ̸= G]

•Pr[F̂ ̸= Y] ≤ Pr[F ̸= G]

19

Page 43: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Let’s rewrite Theorem 2!

Theorem 3

For all F ∈ H1, G ∈ H2 that satisfy weak rule dependence and arenontrivial predictors in the sense that minu Pr[F = u] > Pr[F ̸= G],one of the following inequalities holds:

•Pr[F ̸= Y] ≤ Pr[F ̸= G]

•Pr[F̂ ̸= Y] ≤ Pr[F ̸= G]

19

Page 44: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

A New Bound!

20

Page 45: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Greedy Agreement Algorithm andYarowsky Algorithm

Page 46: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Greedy Agreement Algorithm (i): Specifics

The algorithm can be written as:

Input: seed rules F, Gloopfor each atomic rule H G' = G + Hevaluate cost of (F,G')keep lowest-cost G'if G' is worse than G,quitswap F, G'

21

Page 47: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

GAA (ii): One another generalization

We explain this by:

• begins with two seed rules, one for each view• for each iteration, each possible extension to one of the rules isconsidered and scored

• The best one is kept, and attention shifts to the other rule.

The cost of a classifier pair (F;G) explains a new ground for thegeneralization of Theorem 2

22

Page 48: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

GAA(iii): Theorem

Theorem 4: If view independence is satisfied, and if F and G are rulesbased on different views, then one of the following holds:

• Pr[F ̸= Y|F ̸=⊥] ≤ δµ−δ

• Pr[F̂ ̸= Y|F ̸=⊥] ≤ δµ−δ

• where δ = Pr[F ̸= G|F,G ̸=⊥], and µ = minu Pr[F = u|F ̸=⊥]

• In other words, for a given binary rule F, a pessimistic estimateof the number of errors made by F is δ/(µ− δ) times thenumber of instances labeled by F, plus the number of instancesleft unlabeled by F

23

Page 49: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

GAA(iii): Theorem

• Pr[F ̸= Y|F ̸=⊥] ≤ δµ−δ

• Pr[F̂ ̸= Y|F ̸=⊥] ≤ δµ−δ

23

Page 50: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

GAA(iii): Theorem

• Pr[F ̸= Y|F ̸=⊥] ≤ δµ−δ

• Pr[F̂ ̸= Y|F ̸=⊥] ≤ δµ−δ

23

Page 51: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

GAA(iii): Theorem

• Pr[F ̸= Y|F ̸=⊥] ≤ δµ−δ

• Pr[F̂ ̸= Y|F ̸=⊥] ≤ δµ−δ

23

Page 52: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Greedy Agreement Algorithm (iv): Viz

24

Page 53: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Yarowsky Algorithm (i) : Justification

Yarowsky did not give a justification, but let’s give one...

Let F represents an atomic rule under consideration, and Grepresents the current classifier. Recall that Yℓ is the set of instanceswhose true label is ℓ, and Gℓ is the set of instances{x : G(x) = ℓ} Wewrite G∗ for the set of instances labeled by the current classifier, thatis,{x : G(x) ̸=⊥}

So let’s provide the grounds for precision independence andbalanced errors assumptions...

25

Page 54: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Yarowsky Algorithm (ii) : Assumptions

Definition 5

Candidate rule Fℓ and classifier G satisfy precision independencejust in case

P(Yℓ|Fℓ,G∗) = P(Yℓ|Fℓ)

A bootstrapping problem instance satisfies precision independencejust in case all rules G and all atomic rules Fℓ that nontriviallyoverlap with G (both Fℓ ∩ G∗ and Fℓ− G∗are nonempty) satisfyprecision independence.

In fact, it is only “half” an independence assumption—for precisionindependence, it is not necessary that the conditional on G∗ is equal.

The second assumption is that classifiers make balanced errors.That is:

P(Yℓ,Gℓ̂|Fℓ) = P(Yℓ̂,Gℓ|Fℓ)26

Page 55: Lecture 2: Bootstrapping · Lecture 2: Bootstrapping StevenAbney Presenter:ZhihangDong; Adv.NaturalLang.Proc. April25,2018

Yarowsky Algorithm (iii) : Assumptions

Theorem 5 If the assumptions of precision independence andbalanced errors are satisfied, then the Yarowsky algorithm withthreshold θ obtains a final classifier whose precision is at least µ.Moreover, recall is bounded below by Ntθ/Nℓ, a quantity whichincreases at each round.

27