Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture:...

42
Advanced Topics in Data Sciences Prof. Volkan Cevher volkan.cevher@epfl.ch Lecture 09: Overview of Learning Theory Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne (EPFL) EE-731 (Spring 2016)

Transcript of Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture:...

Page 1: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Advanced Topics in Data Sciences

Prof. Volkan [email protected]

Lecture 09: Overview of Learning TheoryLaboratory for Information and Inference Systems (LIONS)

École Polytechnique Fédérale de Lausanne (EPFL)

EE-731 (Spring 2016)

Page 2: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Outline

This lecture:

1. The probably approximately correct (PAC) learning framework.2. Empirical risk minimization (ERM).3. Approximation and estimation errors.4. Structural risk minimization (SRM).5. Convex surrogate functions.6. Stability and generalization.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 2/ 42

Page 3: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Recommended reading materials

1. Chapters 2–4 in S. Shalev-Shwartz and S. Ben-David, Understanding MachineLearning, Cambridge Univ. Press, 2014.

2. S. Boucheron et al., “Theory of classification: A survey of some recentadvances,” ESIAM: Probab. Stat., 2005.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 3/ 42

Page 4: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

The PAC Learning Framework

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 4/ 42

Page 5: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

The standard statistical learning model

I Training Data: Dn := {Zi : 1 ≤ i ≤ n} ∼ i.i.d. unknown P on Z

I Hypothesis Class: H a set of hypotheses h

I Loss Function: f : H×Z → R

I Risk: F(h) := EPf (h,Z), where Z ∼ P is independent of Dn

I Goal: Find a “good hypothesis” hn ∈ H based on Dn such that F(hn) is“small.”

ObservationStatistical learning corresponds to solving the optimization problem

h? ∈ arg minh∈H

F(h).

However, the optimization problem is not explicitly formulated, because P is unknown.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 5/ 42

Page 6: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Example: Binary classification

I Training Data: Dn = {Zi = (Xi ,Yi) : 1 ≤ i ≤ n}I Xi ∈ Rp are images.I Each Yi ∈ {0, 1} labels whether there is a cat in the image Xi or not.

I Hypothesis Class: H a set of classifiers h : X → {0, 1}I H can be a set of linear classifiers, a reproducing kernel Hilbert space, or all possiblerealizations of a deep network.

I Loss Function: Binary loss f (h,Z) := 1{Yi,h(Xi)}, where Z ∼ P is independentof Dn

I Risk: F(h) := EPf (h,Z), which is the probability of false classification

ObservationThe classifier that minimizes the risk is the Bayes classifier,

h(x) = 1{P(Y=1|X=x)≥1/2},

which is unfortunately intractable, since P is unknown.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 6/ 42

Page 7: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Standard statistics approach: Logistic regression

1. Consider the class of linear classifiers H := {1{〈·,θ〉≤0} : θ ∈ Θ} for someparameter space Θ ⊆ Rp.

2. Assume the statistical model (canonical generalized linear model [15])

P (Yi = 1|Xi = xi) = 1− P (Yi = 0|Xi = xi) =1

1 + exp(−〈xi , θ\〉),

for some θ\ ∈ Θ. (See [9] for a Bayesian interpretation.)

3. Compute the maximum-likelihood estimator

θn ∈ arg minθ∈Θ

Ln(θ),

where Ln is the negative log-likelihood function.

4. Output the classifier hn(·) = 1{〈·,θn〉≤0}.

QuestionWhy should we assume this specific statistical model?

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 7/ 42

Page 8: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Possibly approximately correct (PAC) learnability

Definition (PAC learnability [20])Assume that Yi = g(Xi) for some deterministic function g, and that g ∈ H. (Hencezero risk is possible.)

A hypothesis class H is PAC learnable, if there exist an algorithm AH : Zn →H anda function nH(ε, δ), such that for every probability distribution P and everyε, δ ∈ (0, 1), if n ≥ nH(ε, δ), we have

F(A(Dn)) ≤ ε, (approximately correct)

with probability at least 1− δ (probably).

I The original definition also requires AH to be polynomial time, which we omithere for simplicity.

I The quantity nH(ε, δ) is called the sample complexity.

Questions

1. What if g is not contained in H?2. What if Yi is a general random variable?

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 8/ 42

Page 9: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Agnotic PAC learnability

Definition (Agnostic PAC learnability [7])A hypothesis class H is agnostic PAC learnable, if there exist an algorithmAH : Zn →H and a function nH(ε, δ), such that for every probability distribution Pand every ε, δ ∈ (0, 1), if n ≥ nH(ε, δ), we have

F(A(Dn))− infh∈H

F(h) ≤ ε,

with probability at least 1− δ.

A distribution-dependent and localized formulation [2, 3, 10, 11]Given an algorithm AH : Zn →H, show that for every probability distribution P andevery δ ∈ (0, 1), we have

F(A(Dn))− infh∈H

F(h) ≤ εn(P, h?;H, δ)→ 0,

with probability at least 1− δ, where h? = arg minh∈H F(h) (assuming uniqueness).

I The quantity F(A(Dn))− infh∈H F(h) is called the excess risk.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 9/ 42

Page 10: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Other examples

Linear regression

I Training data: Zi = (Xi ,Yi) ∈ Z = Rp × RI Hypothesis class: H = {hθ(·) = 〈·, θ〉 : θ ∈ Θ} for some Θ ⊆ Rp

I Loss function: Square error f (hθ, z) = (y − 〈x, θ〉)2

Density estimation

I Training data: Zi ∈ RI Hypothesis class: A class of probability densities PI Loss function: Negative log-likelihood f (p, z) = − log p(z)

K -means clustering/Vector quantization

I Training data: Zi ∈ Rp

I Hypothesis class: A class of subsets of Rp of cardinality KI Loss function: f (h, z) = minc∈h ‖c − z‖2

2

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 10/ 42

Page 11: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Empirical Risk Minimization

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 11/ 42

Page 12: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Empirical risk minimization (ERM)

Recall that since P is assumed unknown, we cannot directly solve the risk minimizationproblem

h? ∈ arg minh∈H

F(h) := E f (h,Z).

However, we can consider the empirical risk minimization problem as an approximate,

hn ∈ arg minh∈H

Fn(h) :=1n

∑i≤n

f (h,Zi).

This is called the ERM principle, due to Vapnik and Chervonenkis.

ObservationBy the strong law of large numbers (LLN), we know that Fn(h)→ F(h) almost surelyfor every h ∈ H.

QuestionIs the strong LLN argument enough to conclude that the ERM principle allowslearnability?

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 12/ 42

Page 13: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Two notions of convergence

Definition (Convergence implied by the strong LLN)For every h ∈ H and every probability distribution P, there exists a functionnH(ε, δ; h, P), such that for every ε, δ ∈ (0, 1), if n ≥ nH(ε, δ; h, P), we have

|Fn(h)− F(h)| ≤ ε,

with probability at least 1− δ.

Definition (Uniform convergence)A hypothesis class H has the uniform convergence property, if there exists a functionnH(ε, δ), such that for every ε, δ ∈ (0, 1) and any probability distribution P, ifn ≥ nH(ε, δ), we have

suph∈H|Fn(h)− F(h)| ≤ ε,

with probability at least 1− δ.

I Such an H with the uniform convergence property is called a uniformlyGlivenko-Cantelli class.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 13/ 42

Page 14: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Uniform convergence implies learnability

PropositionFor any ε > 0, if

suph∈H|Fn(h)− F(h)| ≤ ε,

then for any h? ∈ arg minh∈H F(h), we have

F(hn)− F(h?) ≤ 2ε.

Proof.

F(hn)− F(h?) = F(hn)− Fn(hn) + Fn(hn)− Fn(h?) + Fn(h?)− F(h?)

≤ 2 suph∈H|Fn(h)− F(h)|.

ObservationUniform convergence property is sufficient for learnability.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 14/ 42

Page 15: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

? Learnability, ERM, and uniform convergence

Theorem (See, e.g., [17])Assume that the hypothesis class H consists of only {0, 1}-valued functions, and f isthe 0− 1 loss. The following statements are equivalent.

1. The hypothesis class is agnostic PAC learnable.2. The ERM is a good PAC learner.3. The hypothesis class has the uniform convergence property.

FactUnfortunately, computing the corresponding ERM is in general NP-hard [8].

FactIn general, uniform convergence may not be necessary for learnability [18].

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 15/ 42

Page 16: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Uniform convergence property of a finite bounded hypothesis class

PropositionAssume that the hypothesis class H consists of a finite number of functions takingvalues in [0, 1]. Then H satisfies the uniform convergence property with

nH(ε, δ) =log(2|H|/δ)

2ε2 .

The proposition is a simple consequence of Hoeffding’s inequality and the union bound.

Theorem (Hoeffding’s inequality (see, e.g., [14]))Let (ξi)1≤i≤m be a sequence of independent [0, 1]-valued random variables. LetSn := (1/n)

∑1≤i≤n(ξi − E ξi). Then for any t > 0, P (|Sn | ≥ t) ≤ 2 exp

(−2nt2

).

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 16/ 42

Page 17: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Proof

Proof of the proposition.Define ξi(h) = f (h, xi), and define Sn(h) := (1/n)

∑1≤i≤n(ξi(h)− E ξi(h)) for every

h ∈ H. Notice that then

suph∈H|Sn(h)| = sup

h∈H|Fn(h)− F(h)|.

By the union bound and Hoeffding’s inequality, we have for any t > 0,

P

(suph∈H|Sn(h)| ≥ t

)≤∑h∈H

P (|Sn(h)| ≥ t) ≤ |H| · 2 exp(−2nt2

).

Hence it suffices to choose

nH(ε, δ) =log(2|H|/δ)

2ε2 .

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 17/ 42

Page 18: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Necessity of choosing a not-too-big hypothesis class

We may write the proposition in another way:For every probability distribution P and every δ ∈ (0, 1), the ERM satisfies

suph∈H|Fn(h)− F(h)| ≤ εn :=

√log(2|H|/δ)

2n,

with probability at least 1− δ.

ObservationIf |H| is large, we need a large number of training data of the order O(log |H|) toachieve a small excess risk εn .Otherwise, if εn is large, the values of Fn and F can be very different on certainhypotheses, and overfitting occurs.

QuestionWhat if H is too small?

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 18/ 42

Page 19: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

? What if |H| is not finite?Consider the binary classification problem, in which H is a set of {0, 1}-valuedfunctions, and f is the 0− 1 loss.

Definition (Shattering coefficient)The shattering coefficient of a hypothesis class H is defined as

Sn(H) := supx1,...,xn∈X

|{(h(xi))1≤i≤n : h ∈ H}|.

Definition (Vapnik-Chervonenkis (VC) dimension)The VC dimension of a hypothesis class H, denoted by VC(H), is defined as thelargest integer k such that Sk(H) = 2k . If Sk(H) = 2k for all k, then VC(H) :=∞.

Theorem ([21])Let H be a hypothesis class with VC dimension d. Then

suph∈H|Fn(h)− F(h)| ≤ 2

√2d log(2en/d)

n+

√log(2/δ)

2n,

with probability at least 1− δ.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 19/ 42

Page 20: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Model Selection and Structural Risk Minimization

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 20/ 42

Page 21: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Approximation and estimation errorsLet hopt be a global minimizer of the risk F(·) which is not necessarily in H.Let h? be a minimizer of the risk F(·) on H.Then we can write

F(hn)− F(hopt) = F(hn)− F(h?) + F(h?)− F(hopt).

Definition (Approximation error)The approximation error is defined as Eapp = F(h?)− F(hopt).

I The approximation error is fixed given a hypothesis class H.I The ERM can yield small risk only if H contains a “good enough” hypothesis.

Definition (Estimation error)The estimation error is defined as Eest = F(hn)− F(h?).

I The estimation error decreases with the training data size.

ObservationIf we shrink the hypothesis class, while the estimation error Eest can be smaller, doingso can only increase the approximation error Eapp.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 21/ 42

Page 22: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Model selection

Model selection seeks a balance between approximation and estimation errors.

The model selection problemLet H be a hypothesis class. Consider a countable family of sub-classes {Hk : k ∈ K}such that

⋃k∈KHk = H. Denote by hn,k an empirical risk minimizer chosen based

on Dn in Hk for all k ∈ K.

The model selection problem asks to choose a kn ∈ K based on Dn , such that

F(hn,kn)− F(h?) ≤ C inf

k∈K

(inf

h∈HkF(h)− F(h?) + πn(k)

),

with high probability for some constant C > 0 and πn(k) > 0.

I Such an inequality on F(hn,kn)− F(h?) is called an oracle inequality.

I If C = 1, the oracle inequality is called sharp.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 22/ 42

Page 23: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Structural risk minimization (SRM)

The idea of structural risk minimization is to minimize a risk estimate.

Structural risk minimization (see, e.g., [22])1. Choose

kn ∈ arg mink∈K

(Fn(hn,k) + πn(k)),

where πn(k) is some good estimate of F(hn,k)− Fn(hn,k).

2. Output hn = hn,kn.

I Computational complexity is completely ignored here.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 23/ 42

Page 24: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Structural risk minimization (SRM) contd.

Theorem ([1])Suppose there exists a double sequence (Rn,k)n∈N,k∈K, such that for every n ∈ N,k ∈ K, and ε > 0,

P(

F(hn,k) > Rn,k + ε)≤ αn exp(−2βnε

2),

for some constants αn , βn > 0. Set πn(k) := Rn,k − Fn(hn,k) +√β−1

n log k. Thenwe have

F(hn) < infk

(inf

h∈HkF(h) + πn(k) +

√log k

n

)+ ε,

with probability at least 1− 2αn exp(−βnε2/2)− 2 exp(−nε2/2).

ObservationI The risk bound based on the VC dimension may be used [13, 22], but it can beloose since the bound is for the worst case.

I Hence it is important to find sharp data dependent risk estimates. See [12] forsome recent advances.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 24/ 42

Page 25: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Convex Surrogate Functions

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 25/ 42

Page 26: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Logistic regression as a learning algorithm

Logistic regression as a learning algorithmGiven training data (xi , yi) ∈ Rp × {±1}, 1 ≤ i ≤ n.Given a hypothesis class {sign(〈x, θ〉) : θ ∈ Θ} (linear classifiers) for some Θ ⊂ Rp.Solve the empirical risk minimization (?) problem:

θn ∈ arg minθ∈Θ

1n

∑1≤i≤n

log [1 + exp (−yi〈xi , θ〉)] .

Output the classifier hn(x) = sign(〈x, θn〉).

ObservationUnlike the empirical risk minimization problem with the 0− 1 loss, the logisticregression approach yields a convex optimization problem that can be efficientlysolved (when Θ is also convex).

QuestionWhy does logistic regression work for binary classification?

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 26/ 42

Page 27: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Intuition

-10 -5 0 5 100

2

4

6

8

10

12logistic loss (properly shifted)0-1 loss

I Logistic loss: φ(t) = log(1 + exp(−t))I 0− 1 loss: φ(t) = 1{t≤0}

(For logistic regression, t corresponds to y〈x, θ〉.)

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 27/ 42

Page 28: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Soft classification

Consider a cost function g : H×Z → R.

Definition (Margin-based cost [6])A cost function g is margin-based, if it can be written as g(h, z) = φ(yh(x)) for somefunction φ.

ExampleIn logistic regression, φ(t) = log(1 + exp(−t)), and h ∈ H = {〈·, θ〉 : θ ∈ Θ}.

The corresponding empirical cost minimization problem is given by

hn ∈ arg minh∈H

1n

∑1≤i≤n

φ(yih(xi)).

The corresponding soft classifier is given by

hn(x) = sign(hn(x)).

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 28/ 42

Page 29: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Zhang’s lemma

Define

Hφ(η, α) = ηφ(α) + (1− η)φ(−α), α?φ(η) = arg minα

Hφ(η, α).

Let F(h) = P(sign(h(X)) , Y ) denote the risk function, and G(h) = Eφ(Yh(X)) bethe expected cost function.

Zhang’s lemma ([23])Assume that φ is convex, and α?φ(η) > 0 when η > 1/2. If there exist c > 0 and s ≥ 1such that for all η ∈ [0, 1],

|1/2− η|s ≤ cs [H(η, 0)−H(η, α?(η))]1/s ,

Then for any hypothesis h,

F(h)−minh

F(h) ≤ 2c[

G(h)−minh

G(h)]1/s

.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 29/ 42

Page 30: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Risk bound for `1-regularized logistic regression

I For the logistic regression, c = 1/√

2 and s = 2.

TheoremConsider the `1-regularized logistic regression with Θ = {θ : ‖θ‖1 ≤ ν} for someν > 0. Assume that ‖x‖∞ ≤ 1 for all x ∈ X . Then there exists a constant C > 0depending only on p, such that with probability at least 1− δ,

F(hn)− infh

F(h) ≤ 4

√Cn

+

√2 log(1/δ)

n

)1/2

+√

2[(

infh∈H

G(h))−(

infh

G(h))]1/2

.

Proof.Similar to Theorem 4.4 in [4]. �

I The right-hand side may be viewed as the sum of the estimation error andapproximation error (w.r.t. the cost).

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 30/ 42

Page 31: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Other examples

Recall the risk bound

F(h)−minh

F(h) ≤ 2c[

G(h)−minh

G(h)]1/s

.

AdaBoost (See, e.g., [16])Adaboost is equivalent to solving an empirical cost minimization problem withφ(t) = exp(−t), for which s = 2 and c = 1/

√2.

I Notice that in practice AdaBoost may not be implemented by directly solving theempirical cost minimization problem.

Support vector machine (See, e.g., [19])The hinge cost function used by the support vector machine (SVM) corresponds toφ(t) = max(0, 1− t), for which s = 1 and c = 1/2.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 31/ 42

Page 32: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Stability

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 32/ 42

Page 33: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Analysis of SVM

A linear SVM is given by

θn ∈ arg minθ∈Θ

1n

∑1≤i≤n

φ(yi〈xi , θ〉) + λ ‖θ‖22 ,

for some λ > 0, where φ(t) = max(0, 1− t) is the hinge loss.

The output classifier is given by hn(·) = sign(〈·, θn〉).

QuestionHow do we analyze SVM, which is not exactly empirical cost minimization?

IdeaInstead of considering a class of algorithms, we may do an algorithm-wise analysis.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 33/ 42

Page 34: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Stability implies generalization

Definition (Classification stability [5])Consider the soft classification setting, where H is a class of soft classifiers (i.e.,hn = sign(hn)). For any Dn = {z1, . . . , zn}, define D\in as Dn with the i-th elementzi removed. An algorithm A has classification stability with parameter β > 0, if for allDn ⊂ Z and for all 1 ≤ i ≤ n,

‖A(Dn)−A(D\in )‖L∞ ≤ β.

ObservationThen by the triangle inequality,

‖A(Dn)−A(Dn ∪ {z})‖L∞ ≤ 2β, for all z ∈ Z,

meaning the algorithm is robust to a small change of the training data.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 34/ 42

Page 35: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Stability implies generalization contd.Consider the 0− 1 loss f (h, z) = 1{sign(h(x)),y}. Then the risk F(h) = E f (h,Z) isthe probability of classification error.

Define the margin-based loss

f γ(h, z) =

{1 for yh(x) , 01− yh(x)/γ for 0 ≤ yh(x) ≤ γ0 for yh(x) ≥ γ

,

and the corresponding margin-based empirical risk Fγn (h) = (1/n)∑

1≤i≤n fγ(h, zi).

Theorem ([5])Let A be a soft classification algorithm that possesses classification stability withparameter βn > 0. Then for any γ > 0, n ∈ N, and any δ ∈ (0, 1),

F(A(Dn)) ≤ Fγn (A(Dn)) + 2βn

γ+(

1 + 4nβn

γ

) √ log(1/δ)2n

.

ObservationNotice that the uniform convergence property is not required.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 35/ 42

Page 36: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Risk bound for the linear SVM

Recall that the linear SVM defines the algorithm ASVM(Dn) = 〈·, θn〉, where

θn ∈ arg minθ∈Θ

1n

∑1≤i≤n

φ(yi〈xi , θ〉) + λ ‖θ‖22 .

TheoremAssume that ‖x‖2 ≤ κ for all x ∈ X for some κ > 0. Then ASVM has classificationstability with parameter βn = κ2/(2λn). Hence for any n ∈ N and δ ∈ (0, 1),

F(ASVM(Dn)) ≤ F1n(ASVM(Dn)) +

κ2

λn+(

1 +2κ2

λ

) √log(1/δ)

2n,

with probability at least 1− δ.

Proof.Similar to Example 2 in [5]. �

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 36/ 42

Page 37: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Review

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 37/ 42

Page 38: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

Review

I Learning theory is concerned with developing learning algorithms that havedistribution-free guarantees.

I The ERM principle provides a principled approach, if the uniform convergenceproperty holds.

I SRM is an extension of the ERM principle that seeks a balance between theestimation error and the approximation error.

I In practice, we may replace the loss by an convex surrogate to yield an efficientlysolvable empirical cost minimization problem.

I Stability provides another algorithm-wise analysis framework.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 38/ 42

Page 39: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

References I

[1] Peter L. Bartlett, Stéphane Boucheron, and Gábor Lugosi.Model selection and error estimation.Mach. Learn., 48:85–113, 2002.

[2] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson.Local Rademacher complexities.Ann. Stat., 33(4):1497–1537, 2005.

[3] Peter L. Bartlett and Shahar Mendelson.Rademacher and Gaussian complexities: Risk bounds and structural results.J. Mach. Learn. Res., 3, 2002.

[4] Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi.Theory of classification: A survey of some recent advances.ESAIM: Probab. Stat., 9:323–375, November 2005.

[5] Olivier Bousquet and André Elisseeff.Stability and generalization.J. Mach. Learn. Res., 2:499–526, 2002.

[6] Corinna Cortes and Vladimir Vapnik.Support-vector networks.Mach. Learn., 20:273–297, 1995.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 39/ 42

Page 40: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

References II

[7] David Haussler.Decision theoretic generalizations of the PAC model for neural net and otherlearning applications.Inf. Comput., 100:78–150, 1992.

[8] Klaus-U. Höffgen and Hans-U. Simon.Robust trainability of single neurons.J. Comput. Syst. Sci., 50:114–125, 1995.

[9] Michael I. Jordan.Why the logistic function? a tutorial discussion on probabilities and neuralnetworks.MIT Computational Cognitive Science report 9503, 1995.

[10] V. Koltchinskii and D. Panchenko.Rademacher processes and bounding the risk of function learning.2004.arXiv:math/0405338v1 [math.PR].

[11] Vladimir Koltchinskii.Local Rademacher complexities and oracle inequalities in risk minimization.Ann. Stat., 34(6):2593–2656, 2006.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 40/ 42

Page 41: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

References III

[12] Vladimir Koltchinskii.Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems.Springer-Verl., Berlin, 2011.

[13] Gábor Lugosi and Kenneth Zeger.Concept learning using complexity regularization.IEEE Trans. Inf. Theory, 42(1):48–54, 1996.

[14] Pascal Massart.Concentration Inequalities and Model Selection.Springer-Verl., Berlin, 2007.

[15] P. McCullagh and J. A. Nelder.Generalized Linear Models.Chapman and Hall, London, second edition, 1989.

[16] Robert E. Schapire and Yoav Freund.Boosting.MIT Press, Cambridge, MA, 2012.

[17] Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning.Cambridge Univ. Press, Cambridge, UK, 2014.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 41/ 42

Page 42: Advanced Topics in Data Sciences · 2019. 1. 23. · Outline Thislecture: 1.Theprobablyapproximatelycorrect(PAC)learningframework. 2.Empiricalriskminimization(ERM). 3.Approximationandestimationerrors.

References IV

[18] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan.Learnability, stability and uniform convergence.J. Mach. Learn. Res., 11:2635–2670, 2010.

[19] Ingo Steinwart and Andreas Christmann.Support vector machines.Springer, New York, NY, 2008.

[20] L. G. Valiant.A theory of the learnable.Commun. ACM, 27(11):1134–1142, November 1984.

[21] V. N. Vapnik and A. Ya. Chervonenkis.On the uniform convergence of relative frequencies of events to their probabilities.Theory Probab. Appl., XVI(2):264–280, 1971.

[22] Vladimir N. Vapnik.Statistical Learning Theory.John Wiley & Sons, New York, NY, 1998.

[23] Tong Zhang.Statistical behavior and consistency of classification methods based on convex riskminimization.Ann. Stat., 32(1):56–134, 2004.

Advanced Topics in Data Sciences | Prof. Volkan Cevher, [email protected] Slide 42/ 42