Computational Learning Theory...3 Computational Learning Theory What general laws constrain...

Post on 01-Oct-2020

6 views 0 download

Transcript of Computational Learning Theory...3 Computational Learning Theory What general laws constrain...

1

Computational Learning Theory

How many training examples are sufficient to successfully learn the target function?

How much computational effort is needed to converge to a successful hypothesis?

How many mistakes will the learner make before succeeding?

2

Computational Learning Theory

What is “success”?Inductive learning

1. Learner poses queries to teacher 2. Teacher chooses examples 3. Randomly generated instances, labeled by

teacher Probably approximately correct (PAC) learning Vapnik-Chervonenkis (VC) Dimension Mistake bounds

3

Computational Learning Theory

What general laws constrain inductive learning? We seek theory to relate:

1. Probability of successful learning2. Number of training examples3. Complexity of hypothesis space4. Accuracy to which target concept is approximated5. Manner in which training examples presented

4

Prototypical Concept Learning Task

Given: Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, ForecastTarget function c: EnjoySport : X → {0,1}Hypotheses H: Conjunctions of literals. E.g. <?, Cold, High, ?, ?, ?>Training examples D: Positive and negative examples of the target function <x1,c(x1)>, …, <xm,c(xm)>

Determine: 1. A hypothesis h in H such that h(x) = c(x) for all x in D? 2. A hypothesis h in H such that h(x) = c(x) for all x in X?

5

Sample Complexity

How many training examples are sufficient to learn the target concept?

1. If learner proposes instances, as queries to teacherLearner proposes instance x, teacher provides c(x)

2. If teacher (who knows c) provides training examples teacher provides sequence of examples of form <x, c(x)>

3. If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x)

6

Sample Complexity: Case 1

Learner provides instance x, teacher provides c(x)

Optimal query strategy:pick instance x such that half of hypotheses in VS classify x positive, half classify x negative How many queries are needed to learn c?

log2⏐H⏐

7

Sample Complexity: Case 2Teacher provides training examples

Optimal teaching strategy: depends on H used by learner

Consider the case H = conjunctions of up to n boolean literals and their negations e.g., (AirTemp = Warm) ∧ (Wind = Strong),where AirTemp,Wind, … each have 2 possible values.

How many examples are needed to learn c? use Find-S

8

Find-S Algorithm

1. Initialize h to the most specific hypothesis in H2. For each positive training instance x

For each attribute constraint ai in hIf the constraint ai in h is satisfied by xthen do nothingelse replace ai in h by the next more general constraint that is satisfied by x

3. Output hypothesis h

9

Sample ComplexityExample:

attributes A,B,C,D,E, concept (A=0) ∧ (C=1)+ : 01111 Find-S : (A=0) ∧ (B=1) ∧ (C=1) ∧ (D=1) ∧ (E=1) + : 00111 Find-S generalize (A=0) ∧ (C=1) ∧ (D=1) ∧ (E=1) + : 00101 Find-S generalize (A=0) ∧ (C=1) ∧ (E=1) + : 00100 Find-S generalize (A=0) ∧ (C=1) - : 10100 Find-S do not generalize to (C=1) - : 00000 Find-S do not generalize to (A=0)

How many examples are needed to learn c? n+1

10

Sample Complexity: Case 3Given:

set of instances X set of hypotheses H set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X

Learner observes a sequence D of training examples of form <x, c(x)>, for some target concept c ∈ C instances x are drawn from distribution D teacher provides target value c(x) for each

Learner must output a hypothesis h estimating c Note: probabilistic instances, noise-free classifications

11

True Error of a Hypothesis

Definition: The true error errorD(h) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.

12

Two Notions of Error

Training error of hypothesis h with respect to target concept c How often h(x) ≠ c(x) over training instances

True error of hypothesis h with respect to c How often h(x) ≠ c(x) over future random instances

Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (h ∈ VSH,D )

13

Exhausting the Version Space

Definition: The version space VSH,D is said to be ε-exhausted with respect to c and D, if every hypothesis h in VSH,D has error less than ε with respect to c and D.

(∀h ∈ VSH,D ) errorD(h) < ε

14

How many examples will ε-exhaust the VS? Theorem: [Haussler, 1988].

If H is finite, and D is a sequence of m ≥ 1 independent random examples of some target concept c, then for any 0 ≤ ε ≤ 1, the probability that VSH,D is not ε-exhausted (with respect to c) is ≤ ⏐H⏐e-εm

This bounds the probability that any consistent learner will output a hypothesis h with error(h) ≥ε

We want to this probability to be below δ⏐H⏐e-εm ≤ δ (7.1)

15

)2.7(1lnln1

1lnlnln

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛+≥

⎟⎠⎞

⎜⎝⎛+=⎟⎟

⎞⎜⎜⎝

⎛≥

≤−

δε

δδε

δ

δ

δ

ε

ε

ε

Hm

HH

m

He

eH

eH

m

m

m

16

Learning Conjunctions of Boolean Literals

How many examples are sufficient to assure with probability at least (1 - δ) that every h in VSH,D satisfies errorD(h) ≤ ε ?

Use our theorem:

Suppose H contains conjunctions of constraints on up to n boolean attributes (i.e., n boolean literals).

Then ⏐H⏐= 3n , and

or

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛+≥δε1lnln1 Hm

17

Example: Enjoy Sport

If H is as given in EnjoySport then ⏐H⏐ = 973, and

... if want to assure that with probability 95%, VS contains only hypotheses with errorD(h) ≤ 0.1:

18

Probably Approximately Correct (PAC) LearningConsider a class C of possible target concepts

defined over a set of instances X of length n, and a learner L using hypothesis space H.

Definition: C is PAC-learnable by L using H if for all c ∈ C, distributions D over X, ε such that 0 ≤ ε≤1/2, and δ such that 0 ≤ δ ≤ 1/2, learner L will with probability at least (1 - δ) output a hypothesis h ∈ H such that errorD(h) ≤ε, in time that is polynomial in 1/ε,1/δ, n and size(c).

19

Agnostic Learning

So far, assumed c ∈ H

Agnostic learning setting: don't assume c ∈ H

What do we want then?

The hypothesis h that makes fewest errors on training data

What is sample complexity in this case?

20

Agnostic LearningHoeffding bounds: if the training error errorD(h) is measured over the

set D containing m randomly drawn examples, then

Consider the probability that any one of the |H| hypotheses could have a large error

)3.7(1ln||ln21

1ln||ln||ln2

||||||

2

2

2

22

2

22

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛+≥

⎟⎠⎞

⎜⎝⎛+=⎟

⎠⎞

⎜⎝⎛≥

≥⇔≤−

δε

δδε

δ

δδ

ε

εε

Hm

HHm

He

HeeH

m

mm

22||])()()(Pr[( εε mDD eHherrorherrorHh −≤+>∈∃

21

Shattering a Set of Instances

Definition: a dichotomy of a set S is a partition of S into two disjoint subsets.

Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.

22

Three Instances ShatteredInstance space X

23

The Vapnik-Chervonenkis Dimension

Definition: The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) ≡ ∞.

24

VC-Dim of a Set of Real Numbers

X = ℜh ∈ H, h has the form a < x < bVC(H) = ?Consider S = {3.1,5.7}. 4 hypotheses

(1 < x < 2), (1 < x < 4), (4 < x < 7), and (1 < x < 7) shatter S⇒VC(H) ≥ 2

S = {x0, x1, x2} cannot be shartted ⇒VC(H) < 3⇒VC(H) = 2

25

VC-Dim of Linear Decision SurfacesX = instances corresponding to points (x,y)H = set of linear decision sufarcesVC(H) = ?2 points → can be shattered3 points → can be shattered4 points → cannot be shattered

⇒ VC(H)=3In general for a k-dimensional space VC(H)=k+1

26

Sample Complexity from VC-DimensionHow many randomly drawn examples suffice to ε-exhaust VSH,Dwith probability at least (1-δ)? [Blumer et al. 1989]

(7.7)

Lower bound on sample complexity: [Ehrenfeucht et al. 1989]Consider any concept class C such that VC(C)≥2, any learner L and any 0 <ε< 1/8, and 0<δ<1/100. There exists a distribution D* and target concept in C such that if L observes fewer examples than

then with probability ≥ δ, L outputs a hypothesis h having errorD*(h)>ε

27

Mistake Bounds

So far: how many examples needed to learn? What about: how many mistakes before convergence? Let's consider similar setting to PAC learning:

Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging?

28

Mistake Bounds: Find-S

Consider Find-S when H=conjunction of n boolean literals

Initialize h to the most specific hypothesis

l1 ∧ ¬l1 ∧ l2 ∧ ¬l2 … ∧ ¬ln ∧ ¬lnFor each positive training instance x

Remove from h any literal that is not satisfied by x

Output hypothesis h

29

Mistake Bounds: Find-S

How many mistakes before converging to correct h? Find-S can never mistakenly classify a negative instance as positive.1st positive instance: n out of 2n literals are eliminated.∀subsequent positive example: ≥ 1 of the remaining n terms must be eliminated from the hypothesis.

⇒ The total number of mistakes ≤ n+1 (when (∀x) c(x) = 1).

30

Mistake Bounds: Halving AlgorithmConsider Halving Algorithm:

Learn concept using VS Candidate-Eliminationalgorithm Classify new instances by majority vote of VS members

How many mistakes before converging to correct h?… in worst case:

Mistake appears when the majority of hypotheses in its current VS incorrectly classify the new example. VS will be reduced to at most half size. #mistakes ≤ log2|H|

… in best case: none

31

Candidate Elimination AlgorithmG ← maximally general hypotheses in HS ← maximally specific hypotheses in HFor each training example d=<x,c(x)>If d is a positive example

Remove from G any hypothesis that is inconsistent with dFor each hypothesis s in S that is not consistent with d

remove s from S.Add to S all minimal generalizations h of s such that

h consistent with dSome member of G is more general than h

Remove from S any hypothesis that is more general than another hypothesis in S

32

Candidate Elimination Algorithm

If d is a negative exampleRemove from S any hypothesis that is inconsistent with dFor each hypothesis g in G that is not consistent with d

remove g from G.Add to G all minimal specializations h of g such that

h consistent with dSome member of S is more specific than h

Remove from G any hypothesis that is less general than another hypothesis in G

33

Optimal Mistake BoundsLet MA(C) be the max #mistakes made by algorithm A to

learn concepts in C. (maximum over all possible c ∈ C, and all possible training sequences)

Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of MA(C).

34

Weighted Majority Algorithm

Predict by taking a weighted vote among a pool of prediction algorithms A and learns by altering the weight associated with each prediction algorithm.

Able to accommodate inconsistent training data

#mistakes of WEIGHTED-MAJORITY can be bound by #mistakes committed by the best of A

35

Weighted Majority AlgorithmPooling of multiple learning algorithms

ai is the i-th prediction algorithm in A, wi its weightInitialize wi=1For each training example <x,c(x)>

Initialize q1, q0 to 0For each algorithm ai

If ai(x) = 0 then q0 = q0 + wi

If ai(x) = 1 then q1 = q1 + wi

If q1>q0 predict c(x)=1 else predict c(x)=0For each algorithm ai in A

if ai(x) ≠ c(x) then wi = wi * β

36

Mistake Bound Weighted Majority Alg.

Let:D - any sequence of training examplesA - any set of n prediction algorithmsk - minimum #mistakes over D made by any algorithm in A for the training sequence D.

Then:

#mistakes over D made by the weighted-majority algorithm using β=1/2 is ≤ 2.4(k+ log2n)

37

Mistake Bound Weighted MajorityAlg.

Let aj = algorithm with an optimal #mistakes k.⇒ Final weight wj = (1/2)k

Consider W = Σi wi for all n algorithms in A. W is initially n.For each mistake, W is reduced to at most ¾W, because the algorithms voting in the weighted majority must hold at least ½W, and this portion of W will be reduced by ½.Let M = total #mistakes committed by weighted-majority for training sequence D, the total final weight is ≤ n(3/4)M.The final weight wj of the optimal algorithm cannot be greater than the total weight:

(1/2)k ≤ n (3/4)M

)log(4.2

43log

log2

2

2 nknkM +≤⎟⎠⎞

⎜⎝⎛−

+≤

38

Summary

Training errorSample errorTrue error errorD(h) Sample complexityMistake boundProbably approximately correct (PAC)Vapnik-Chervonenkis dimensionWeighted Majority Algorithm