Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1...

1

Computational Learning Theory: An Analysis of a Conjunction Learner

MachineLearningFall2017

SupervisedLearning:TheSetup

1

Machine LearningSpring 2018

The slides are mainly from Vivek Srikumar

This section

1. Analyze a simple algorithm for learning conjunctions

2. Define the PAC model of learning

3. Make formal connections to the principle of Occam’s razor

2

Learning Conjunctions

Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

3

The true function


Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

A simple learning algorithm (Elimination)

• Discard all negative examples

4


Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100


• Discard all negative examples• Build a conjunction using the features

that are common to all positive conjunctions

5


Training data– <(1,1,1,1,1,1,…,1,1), 1>– <(1,1,1,0,0,0,…,0,0), 0>– <(1,1,1,1,1,0,...0,1,1), 1>– <(1,0,1,1,1,0,...0,1,1), 0>– <(1,1,1,1,1,0,...0,0,1), 1>– <(1,0,1,0,0,0,...0,1,1), 0>– <(1,1,1,1,1,1,…,0,1), 1>– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100




6

Positive examples eliminate irrelevant features


Training data– <(1,1,1,1,1,1,…,1,1), 1>

– <(1,1,1,0,0,0,…,0,0), 0>

– <(1,1,1,1,1,0,...0,1,1), 1>

– <(1,0,1,1,1,0,...0,1,1), 0>

– <(1,1,1,1,1,0,...0,0,1), 1>

– <(1,0,1,0,0,0,...0,1,1), 0>

– <(1,1,1,1,1,1,…,0,1), 1>

– <(0,1,0,1,0,0,...0,1,1), 0>

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A simple learning algorithm:



7

Clearly this algorithm produces a conjunction that is consistent with the data, that is errS(h) = 0, if the target function is a monotone conjunctionExercise: Why?

Learning Conjunctions: Analysis

Claim 1: h will only make mistakes on positive future examplesWhy?

8

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100



9

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen For an example to be predicted as positive in the training set, every relevant feature must have been present

Specifically: x1 = 0, x2 =1, x3=1, x4=1, x5=1, x100=1



10

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as negative by h

The reverse situation can never happen For an example to be predicted as positive in the training set, every relevant feature must have been present



Claim 1: h will only make mistakes on positive future examples

Why?

11

f

h++

-

-

-

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

A mistake will occur only if some feature z (in our

example x1) is present in h but not in f

This mistake can cause a positive example to be predicted as

negative by h

The reverse situation can never happen

For an example to be predicted as positive in the training set,

every relevant feature must have been present



Theorem: Suppose we are learning a conjunctive concept with n dimensional Boolean features using m training examples. If

then, with probability > 1 - !, the error of the learned hypothesis errD(h) will be less than ".

12




13

If we see these many training examples, then the algorithm will produce a conjunction that, with high probability, will

make few errors

Poly in n, 1/!, 1/"




14

Let’s prove this assertion

Proof Intuition

What kinds of examples would drive a hypothesis to make a mistake?

Positive examples, where x1 is absentf would say true and h would say false

None of these examples appeared during trainingOtherwise x1 would have been eliminated

If they never appeared during training, maybe their appearance in the future would also be rare!

Let’s quantify our surprise at seeing such examples

15

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Proof Intuition






16

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Proof Intuition






17

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Proof Intuition






18

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100

Proof Intuition






19

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100


Let p(z) be the probability that, in an example drawn from D, the feature z is 0 but the example has a positive label

• That is, after training is done, p(z) is the probability that in a randomly drawn example, the feature z causes a mistake

• For any z in the target function, p(z) = 0

20





21

Remember that there will only be mistakes on positive examples for this toy problem





22

f = x2 ∧ x3∧ x4 ∧ x5∧ x100

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100






23

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 <(0,1,1,1,1,0,...0,1,1), 1>

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100






24

f = x2 ∧ x3∧ x4 ∧ x5∧ x100 <(0,1,1,1,1,0,...0,1,1), 1>

p(x1): Probability that this situation occurs

h = x1∧ x2 ∧ x3∧ x4 ∧ x5∧ x100






We know that

Via direct application of the union bound

25





We know that


26





We know that


27

Union boundFor a set of events, probability that at least one of them happens < the sum of the probabilities of the individual events


• Call a feature z bad if• Intuitively, a bad feature is one that has a significant probability

of not appearing with a positive example– (And, if it appears in all positive training examples, it can cause errors)

If there are no bad features, then errD(h) < !– Why? Because

28

n = dimensionality

Let us try to see when this will not happen





29

n = dimensionality






30

n = dimensionality





What if there are bad features?Let z be a bad featureWhat is the probability that it will not be eliminated by one training example?

31

n = dimensionality





32

n = dimensionality





33

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1

2log |KBB + �A1|�

1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j

k(B,xij )(2yij � 1)N�k(B,xij )

>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

�

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

�

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�

+

Zlog

�N (v|0, k(B,B)

�

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

4





34

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1


1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j


>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

�

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

�

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�

+

Zlog

�N (v|0, k(B,B)

�

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

4

<(1,1,1,1,1,0,...0,1,1), 1>There was one example of this kind


What we know so far:

But say we have m training examples. Then

There are at most n bad features. So

35

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1


1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j


>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

�

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

�

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�

+

Zlog

�N (v|0, k(B,B)

�

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�

Pr(A bad feature is not eliminated by one example) 1� ✏

n

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1


1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j


>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

�

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

�

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�

+

Zlog

�N (v|0, k(B,B)

�

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�


n

Pr(A bad feature survives m examples) �1� ✏

n

�m

4

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

�


n


n

�m

Pr(Any bad feature survives m examples) n�1� ✏

n

�m

5





36

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1


1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j


>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

�

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

�

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�

+

Zlog

�N (v|0, k(B,B)

�

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�


n

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1


1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j


>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

�

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

�

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�

+

Zlog

�N (v|0, k(B,B)

�

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�


n


n

�m

4

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

�


n


n

�m


n

�m

5





37

n = dimensionality

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1


1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j


>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

�

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

�

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�

+

Zlog

�N (v|0, k(B,B)

�

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�


n

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

L1(U ,B) =1

2log |KBB |�

1


1

2�a2 �

1

2�a3

� 1

2

KX

k=1

kU(k)k2F +1

2�2a4

>(KBB + �A1)

�1a4

+�

2tr(K�1

BBA1) + CONST

L2(U ,B,�) =1

2log |KBB |�

1

2log |KBB +A1|�

1

2a2 + a3

� 1

2�>KBB�+

1

2tr(K�1

BBA1)�1

2

KX

k=1

kU(k)k2F

�(t+1)= (KBB +A1)

�1(A1�

(t)+ a6)

a6 =

X

j


>�(t)|0, 1�

��(2yij � 1)k(B,xij )

>�(t)�

L1(U ,B, q(v)) = log(p(U)) +Z

q(v) logp(v|B)

q(v)dv +

Xj

Zq(v)Fv(yij ,�)dv

L1

�y,U ,B, q(v)

�

B,UL⇤1(y,U ,B)

L1

�y,U ,B, q(v)

�

(A1,rA1)

(a2,ra2)

(a3,ra3)

(a4,ra4)

(a5,ra4)

L1(y,U ,B, q(v)) = log(p(U)) +H�q(v)

�

+

Zlog

�N (v|0, k(B,B)

�

+

Xj

Zq(v)Fv(yij ,�)dv

sgn� kX

i=1

ci · sgn(w>i x)

�(2)

= sgn� kX

i=1

ciw>i x

�(3)

yiw>xi

kwk < ⌘ (4)

s1 + s2 + s3 � 0 (5)�


n


n

�m

4

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

�


n


n

�m


n

�m

5


We want this probability to be small

Why? So that we can choose enough training examples so that the probability that any z survives all of them is less than some !

That is, we want

We know that 1 – x < e-x. So it is sufficient to require

38

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

�


n


n

�m


n

�m

5




That is, we want


39

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

�


n


n

�m


n

�m

5




That is, we want


40

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

�


n


n

�m


n

�m

5




That is, we want


Or equivalently,

41

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

s1 + s2 + s3 � 0 (5)

�


n


n

�m


n

�m

5


To guarantee a probability of failure (i.e, error >!) that is less than ", the number of examples we need is

That is, if m has this property, then• With probability 1 - ", there will be no bad features,• Or equivalently, with probability 1 - ", we will have errD(h) < !

What does this mean:• If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

42

Poly in n, 1/", 1/!




What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

43

Poly in n, 1/", 1/!




What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ! = 0.1 and " = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

44

Poly in n, 1/", 1/!




What does this mean:• If ! = 0.1 and " = 0.1, then for n = 100, we need 6908 training examples• If ! = 0.1 and " = 0.1, then for n = 10, we need only 461 examples• If ! = 0.1 and " = 0.01, then for n = 10, we need only 691 examples• If ! = 0.1 and " = 0.01, then for n = 10, we need 691 examples

45

Poly in n, 1/", 1/!




How to use this:• If ² = 0.1 and ± = 0.1, then for n = 100, we need 6908 training examples• If ² = 0.1 and ± = 0.1, then for n = 10, we need only 461 examples• If ² = 0.1 and ± = 0.01, then for n = 10, we need 691 examples

46

What we have here is a PAC guarantee

Our algorithm is Probably Approximately Correct

Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1...

Documents

Transcript of Supervised Learning: The Setup Computational Learning ...zhe/pdf/lec-12-pac-conjunctions-upload.pdf1...