An introduction to statisticallearning theory
Fabrice Rossi
TELECOM ParisTech
November 2008
About this lecture
Main goalShow how statistics enable a rigorous analysis of machinelearning methods which leads to a better understanding and topossible improvements of those methods
Outline
1. introduction to the formal model of statistical learning2. empirical risk and capacity measures3. capacity control4. beyond empirical risk minimization
2 / 154 Fabrice Rossi Outline
http://apiacoa.org/
About this lecture
Main goalShow how statistics enable a rigorous analysis of machinelearning methods which leads to a better understanding and topossible improvements of those methods
Outline
1. introduction to the formal model of statistical learning2. empirical risk and capacity measures3. capacity control4. beyond empirical risk minimization
2 / 154 Fabrice Rossi Outline
http://apiacoa.org/
Part I
Introduction and formalization
3 / 154 Fabrice Rossi
http://apiacoa.org/
Outline
Machine learning in a nutshell
FormalizationData and algorithmsConsistencyRisk/Loss optimization
4 / 154 Fabrice Rossi
http://apiacoa.org/
Statistical LearningStatistical Learning
= Machine learning + Statistics
Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon
... all of this is done by the computer (no human intervention)
Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools
Nothing is more practical than a good theoryVladimir Vapnik (1998)
5 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Statistical LearningStatistical Learning = Machine learning + Statistics
Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon
... all of this is done by the computer (no human intervention)
Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools
Nothing is more practical than a good theoryVladimir Vapnik (1998)
5 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Statistical LearningStatistical Learning = Machine learning + Statistics
Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon
... all of this is done by the computer (no human intervention)
Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools
Nothing is more practical than a good theoryVladimir Vapnik (1998)
5 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Statistical LearningStatistical Learning = Machine learning + Statistics
Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon
... all of this is done by the computer (no human intervention)
Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools
Nothing is more practical than a good theoryVladimir Vapnik (1998)
5 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Statistical LearningStatistical Learning = Machine learning + Statistics
Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon
... all of this is done by the computer (no human intervention)
Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools
Nothing is more practical than a good theoryVladimir Vapnik (1998)
5 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Statistical LearningStatistical Learning = Machine learning + Statistics
Machine learning in three steps:1. record observations about a phenomenon2. build a model of this phenomenon3. predict the future behavior of the phenomenon
... all of this is done by the computer (no human intervention)
Statistics gives:I a formal definition of machine learningI some guarantees on its expected resultsI some suggestions for new or improved modelling tools
Nothing is more practical than a good theoryVladimir Vapnik (1998)
5 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Machine learning
I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z
I two generic situations:1. unsupervised learning:
I no predefined structure in ZI then the goal is to find some structures: clusters, association
rules, distribution, etc.2. supervised learning:
I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable
value for y such that z = (x, y) is compatible with thephenomenon
I this course focuses on supervised learning
6 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Machine learning
I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z
I two generic situations:1. unsupervised learning:
I no predefined structure in ZI then the goal is to find some structures: clusters, association
rules, distribution, etc.
2. supervised learning:I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable
value for y such that z = (x, y) is compatible with thephenomenon
I this course focuses on supervised learning
6 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Machine learning
I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z
I two generic situations:1. unsupervised learning:
I no predefined structure in ZI then the goal is to find some structures: clusters, association
rules, distribution, etc.2. supervised learning:
I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable
value for y such that z = (x, y) is compatible with thephenomenon
I this course focuses on supervised learning
6 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Machine learning
I a phenomenon is recorded via observations⇒ (zi)1≤i≤nwith zi ∈ Z
I two generic situations:1. unsupervised learning:
I no predefined structure in ZI then the goal is to find some structures: clusters, association
rules, distribution, etc.2. supervised learning:
I two non symmetric component in Z = X × YI z = (x, y) ∈ X × YI modelling: finding how x and y are relatedI the goal is to make prediction: given x, find a reasonable
value for y such that z = (x, y) is compatible with thephenomenon
I this course focuses on supervised learning
6 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Supervised learning
I several types of Y can be considered:I Y = {1, . . . ,q}: classification in q classes, e.g., tell whether
a patient has hyper, hypo or normal thyroidal functionbased on some clinical measures
I Y = Rq : regression, e.g., calculate the price of a house ybased on some characteristics x
I Y = “something complex but structured”, e.g., construct aparse tree for a natural language sentence
I modelling difficulty increases with the complexity of YI Vapnik’s message: “don’t built a regression model to do
classification”
7 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Supervised learning
I several types of Y can be considered:I Y = {1, . . . ,q}: classification in q classes, e.g., tell whether
a patient has hyper, hypo or normal thyroidal functionbased on some clinical measures
I Y = Rq : regression, e.g., calculate the price of a house ybased on some characteristics x
I Y = “something complex but structured”, e.g., construct aparse tree for a natural language sentence
I modelling difficulty increases with the complexity of Y
I Vapnik’s message: “don’t built a regression model to doclassification”
7 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Supervised learning
I several types of Y can be considered:I Y = {1, . . . ,q}: classification in q classes, e.g., tell whether
a patient has hyper, hypo or normal thyroidal functionbased on some clinical measures
I Y = Rq : regression, e.g., calculate the price of a house ybased on some characteristics x
I Y = “something complex but structured”, e.g., construct aparse tree for a natural language sentence
I modelling difficulty increases with the complexity of YI Vapnik’s message: “don’t built a regression model to do
classification”
7 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Model quality
I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y
I a “good” model should be such that
∀1 ≤ i ≤ n, g(xi) ' yi
or not?perfect interpolation is always possible
but is that what we are looking for?I No! We want generalization:
I a good model must “learn” something “smart” from thedata, not simply memorize them
I on new data, it should be such that g(xi ) ' yi (for i > n)
8 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Model quality
I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y
I a “good” model should be such that
∀1 ≤ i ≤ n, g(xi) ' yi
or not?
perfect interpolation is always possiblebut is that what we are looking for?
I No! We want generalization:I a good model must “learn” something “smart” from the
data, not simply memorize themI on new data, it should be such that g(xi ) ' yi (for i > n)
8 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Model quality
I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y
I a “good” model should be such that
∀1 ≤ i ≤ n, g(xi) ' yi
or not?perfect interpolation is always possible
but is that what we are looking for?
I No! We want generalization:I a good model must “learn” something “smart” from the
data, not simply memorize themI on new data, it should be such that g(xi ) ' yi (for i > n)
8 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Model quality
I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y
I a “good” model should be such that
∀1 ≤ i ≤ n, g(xi) ' yi
or not?perfect interpolation is always possible
but is that what we are looking for?I No! We want generalization
:I a good model must “learn” something “smart” from the
data, not simply memorize themI on new data, it should be such that g(xi ) ' yi (for i > n)
8 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Model quality
I given a dataset (xi ,yi)1≤i≤n, a machine learning methodbuilds a model g from X to Y
I a “good” model should be such that
∀1 ≤ i ≤ n, g(xi) ' yi
or not?perfect interpolation is always possible
but is that what we are looking for?I No! We want generalization:
I a good model must “learn” something “smart” from thedata, not simply memorize them
I on new data, it should be such that g(xi ) ' yi (for i > n)
8 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Machine learning objectives
I main goal: from a dataset (xi ,yi)1≤i≤n, build a model gwhich generalizes in a satisfactory way on new data
I but also:I request only as many data as neededI do this efficiently (quickly with a low memory usage)I gain knowledge on the underlying phenomenon (e.g.,
discard useless parts of x)I etc.
I the statistical learning framework addresses some of thosegoals:
I it provides asymptotic guarantees about learnabilityI it gives non asymptotic confidence bounds around
performance estimatesI it suggests/validates best practices (hold out estimates,
regularization, etc.)
9 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Machine learning objectives
I main goal: from a dataset (xi ,yi)1≤i≤n, build a model gwhich generalizes in a satisfactory way on new data
I but also:I request only as many data as neededI do this efficiently (quickly with a low memory usage)I gain knowledge on the underlying phenomenon (e.g.,
discard useless parts of x)I etc.
I the statistical learning framework addresses some of thosegoals:
I it provides asymptotic guarantees about learnabilityI it gives non asymptotic confidence bounds around
performance estimatesI it suggests/validates best practices (hold out estimates,
regularization, etc.)
9 / 154 Fabrice Rossi Machine learning in a nutshell
http://apiacoa.org/
Outline
Machine learning in a nutshell
FormalizationData and algorithmsConsistencyRisk/Loss optimization
10 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Data
I the phenomenon is fully described by an unknownprobability distribution P on X × Y
I we are given n observations:I Dn = (Xi ,Yi )ni=1I each pair (Xi ,Yi ) is distributed according to PI pairs are independent
I remarks:I the phenomenon is stationary, i.e., P does not change: new
data will be distributed as learning dataI extensions to non stationary situations exist, see e.g.
covariate shift and concept driftI the independence assumption says that each observation
brings new informationI extensions to dependent observation exists, e.g., Markovian
models for time series analysis
11 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Data
I the phenomenon is fully described by an unknownprobability distribution P on X × Y
I we are given n observations:I Dn = (Xi ,Yi )ni=1I each pair (Xi ,Yi ) is distributed according to PI pairs are independent
I remarks:I the phenomenon is stationary, i.e., P does not change: new
data will be distributed as learning dataI extensions to non stationary situations exist, see e.g.
covariate shift and concept driftI the independence assumption says that each observation
brings new informationI extensions to dependent observation exists, e.g., Markovian
models for time series analysis
11 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Model quality (risk)I quality is measured via a cost function c: Y × Y → R+ (a
low cost is good!)I the risk of a model g: X → Y is
L(g) = E {c(g(X ),Y )}
This is the expected value of the cost on an observationgenerated by the phenomenon (a low risk is good!)
I interpretation:I the average value of c(g(x),y) over a lot of observations
generated by the phenomenon is L(g)I this is a measure of generalization capabilities (if g has
been built from a dataset)I remark: we need a probability space, c and g have to be
measurable functions, etc.
12 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Model quality (risk)I quality is measured via a cost function c: Y × Y → R+ (a
low cost is good!)I the risk of a model g: X → Y is
L(g) = E {c(g(X ),Y )}
This is the expected value of the cost on an observationgenerated by the phenomenon (a low risk is good!)
I interpretation:I the average value of c(g(x),y) over a lot of observations
generated by the phenomenon is L(g)I this is a measure of generalization capabilities (if g has
been built from a dataset)
I remark: we need a probability space, c and g have to bemeasurable functions, etc.
12 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Model quality (risk)I quality is measured via a cost function c: Y × Y → R+ (a
low cost is good!)I the risk of a model g: X → Y is
L(g) = E {c(g(X ),Y )}
This is the expected value of the cost on an observationgenerated by the phenomenon (a low risk is good!)
I interpretation:I the average value of c(g(x),y) over a lot of observations
generated by the phenomenon is L(g)I this is a measure of generalization capabilities (if g has
been built from a dataset)I remark: we need a probability space, c and g have to be
measurable functions, etc.
12 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
We are frequentists
13 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Cost functionsI Regression (Y = Rq):
I the quadratic cost is the standard one:c(g(x),y) = ‖g(x)− y‖2
I other costs (Y = R):I L1 (Laplacian) cost: c(g(x), y) = |g(x)− y|I �-insensitive: c�(g(x), y) = max(0, |g(x)− y| − �)I Huber’s loss: cδ(g(x), y) = (g(x)− y)2 when |g(x)− y| < δ
and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwise
I Classification (Y = {1, . . . ,q}):I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification
I Remark:I the cost is used to measure the quality of the modelI the machine learning algorithm may use a loss6=cost to
build the modelI e.g.: the hinge loss for support vector machinesI more on this in part IV
14 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Cost functionsI Regression (Y = Rq):
I the quadratic cost is the standard one:c(g(x),y) = ‖g(x)− y‖2
I other costs (Y = R):I L1 (Laplacian) cost: c(g(x), y) = |g(x)− y|I �-insensitive: c�(g(x), y) = max(0, |g(x)− y| − �)I Huber’s loss: cδ(g(x), y) = (g(x)− y)2 when |g(x)− y| < δ
and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwiseI Classification (Y = {1, . . . ,q}):
I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification
I Remark:I the cost is used to measure the quality of the modelI the machine learning algorithm may use a loss6=cost to
build the modelI e.g.: the hinge loss for support vector machinesI more on this in part IV
14 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Cost functionsI Regression (Y = Rq):
I the quadratic cost is the standard one:c(g(x),y) = ‖g(x)− y‖2
I other costs (Y = R):I L1 (Laplacian) cost: c(g(x), y) = |g(x)− y|I �-insensitive: c�(g(x), y) = max(0, |g(x)− y| − �)I Huber’s loss: cδ(g(x), y) = (g(x)− y)2 when |g(x)− y| < δ
and cδ(g(x), y) = 2δ(|g(x)− y| − δ2 ) otherwiseI Classification (Y = {1, . . . ,q}):
I 0/1 cost: c(g(x),y) = I{g(x) 6=y}I then the risk is the probability of misclassification
I Remark:I the cost is used to measure the quality of the modelI the machine learning algorithm may use a loss6=cost to
build the modelI e.g.: the hinge loss for support vector machinesI more on this in part IV
14 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Machine learning algorithm
I given Dn, a machine learning algorithm/method builds amodel gDn (= gn for simplicity)
I gn is a random variable with values in the set ofmeasurable functions from X to Y
I the associated (random) risk is
L(gn) = E {c(gn(X ),Y ) | Dn}
I statistical learning studies the behavior of L(gn) andE {L(gn)} when n increases
:I keep in mind that L(gn) is a random variableI if many datasets are generated from the phenomenon, the
mean risk of the classifiers produced by the algorithm forthose datasets is E {L(gn)}
15 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Machine learning algorithm
I given Dn, a machine learning algorithm/method builds amodel gDn (= gn for simplicity)
I gn is a random variable with values in the set ofmeasurable functions from X to Y
I the associated (random) risk is
L(gn) = E {c(gn(X ),Y ) | Dn}
I statistical learning studies the behavior of L(gn) andE {L(gn)} when n increases:
I keep in mind that L(gn) is a random variableI if many datasets are generated from the phenomenon, the
mean risk of the classifiers produced by the algorithm forthose datasets is E {L(gn)}
15 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Consistency
I best risk:L∗ = inf
gL(g)
I the infimum is taken over all measurable functions from Xto Y
I a machine learning method is:I strongly consistent: if L(gn)
p.s.−−→ L∗I consistent: if E {L(gn)} → L∗I universally (strongly) consistent: if the convergence holds
for any distribution of the data PI remark: if c is bounded then limn→∞ E {L(gn)} = L∗ ⇔
L(gn)P−→ L∗
16 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Consistency
I best risk:L∗ = inf
gL(g)
I the infimum is taken over all measurable functions from Xto Y
I a machine learning method is:I strongly consistent: if L(gn)
p.s.−−→ L∗I consistent: if E {L(gn)} → L∗I universally (strongly) consistent: if the convergence holds
for any distribution of the data PI remark: if c is bounded then limn→∞ E {L(gn)} = L∗ ⇔
L(gn)P−→ L∗
16 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
InterpretationI universality: no assumption on the data distribution
(realistic)I consistency: “perfect” generalization when given an infinite
learning set
I strong case:I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N
L(gn) ≤ L∗ + �
where gn is built on (xi , yi)1≤i≤nI N depends on � and on the dataset (xi , yi)i≥1
I “weak” case:I for any � > 0I there is N such that for n ≥ N
E {L(gn)} ≤ L∗ + �
where gn is built on (xi , yi)1≤i≤n
17 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
InterpretationI universality: no assumption on the data distribution
(realistic)I consistency: “perfect” generalization when given an infinite
learning setI strong case:
I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N
L(gn) ≤ L∗ + �
where gn is built on (xi , yi)1≤i≤nI N depends on � and on the dataset (xi , yi)i≥1
I “weak” case:I for any � > 0I there is N such that for n ≥ N
E {L(gn)} ≤ L∗ + �
where gn is built on (xi , yi)1≤i≤n
17 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
InterpretationI universality: no assumption on the data distribution
(realistic)I consistency: “perfect” generalization when given an infinite
learning setI strong case:
I for (almost) any series of observations (xi , yi)i≥1I and for any � > 0I there is N such that for n ≥ N
L(gn) ≤ L∗ + �
where gn is built on (xi , yi)1≤i≤nI N depends on � and on the dataset (xi , yi)i≥1
I “weak” case:I for any � > 0I there is N such that for n ≥ N
E {L(gn)} ≤ L∗ + �
where gn is built on (xi , yi)1≤i≤n
17 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Interpretation (continued)I We can be unlucky:
I strong case:I L(gn) = E {c(gn(X ),Y ) | Dn} ≤ L∗ + �I but what about
1p
n+pXi=n+1
c(gn(xi), yi)
I weak case:I E {L(gn)} ≤ L∗ + �I but what about L(gn) for a particular dataset?I see also the strong case!
I Stronger result, Probably Approximately Optimal algorithm:∀�, δ, ∃n(�, δ) such that ∀n ≥ n(�, δ)
P {L (gn) > L∗ + �} < δ
I Gives some convergence speed: especially interestingwhen n(�, δ) is independent of P (distribution free)
18 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Interpretation (continued)I We can be unlucky:
I strong case:I L(gn) = E {c(gn(X ),Y ) | Dn} ≤ L∗ + �I but what about
1p
n+pXi=n+1
c(gn(xi), yi)
I weak case:I E {L(gn)} ≤ L∗ + �I but what about L(gn) for a particular dataset?I see also the strong case!
I Stronger result, Probably Approximately Optimal algorithm:∀�, δ, ∃n(�, δ) such that ∀n ≥ n(�, δ)
P {L (gn) > L∗ + �} < δ
I Gives some convergence speed: especially interestingwhen n(�, δ) is independent of P (distribution free)
18 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
From PAO to consistency
I if convergence happens fast enough, then the method is(strongly) consistent
I “weak” consistency is easy (for c bounded):I if the method is PAO, then for any fixed �
P {L (gn) > L∗ + �} −−−→n→∞
0I by definition (and also because L(gn) > L∗), this means
L(gn)P−→ L∗
I then, if c is bounded E {L(gn)} −−−→n→∞
L∗
I if convergence happens faster, then the method might bestrongly consistent
I we need a very sharp decrease of P {L (gn) > L∗ + �} withn, i.e., a slow increase of n(�, δ) when δ decreases
19 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
From PAO to consistency
I if convergence happens fast enough, then the method is(strongly) consistent
I “weak” consistency is easy (for c bounded):I if the method is PAO, then for any fixed �
P {L (gn) > L∗ + �} −−−→n→∞
0I by definition (and also because L(gn) > L∗), this means
L(gn)P−→ L∗
I then, if c is bounded E {L(gn)} −−−→n→∞
L∗
I if convergence happens faster, then the method might bestrongly consistent
I we need a very sharp decrease of P {L (gn) > L∗ + �} withn, i.e., a slow increase of n(�, δ) when δ decreases
19 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
From PAO to consistency
I if convergence happens fast enough, then the method is(strongly) consistent
I “weak” consistency is easy (for c bounded):I if the method is PAO, then for any fixed �
P {L (gn) > L∗ + �} −−−→n→∞
0I by definition (and also because L(gn) > L∗), this means
L(gn)P−→ L∗
I then, if c is bounded E {L(gn)} −−−→n→∞
L∗
I if convergence happens faster, then the method might bestrongly consistent
I we need a very sharp decrease of P {L (gn) > L∗ + �} withn, i.e., a slow increase of n(�, δ) when δ decreases
19 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
From PAO to strong consistency
I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that
if∑
i
P {Ai} L∗ + �} L∗ + �} i .o.]} = 0
I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }
I and therefore, L(gn)p.s.−−→ L∗
20 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
From PAO to strong consistency
I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that
if∑
i
P {Ai} L∗ + �} L∗ + �} i .o.]} = 0
I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }
I and therefore, L(gn)p.s.−−→ L∗
20 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
From PAO to strong consistency
I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that
if∑
i
P {Ai} L∗ + �} L∗ + �} i .o.]} = 0
I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }
I and therefore, L(gn)p.s.−−→ L∗
20 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
From PAO to strong consistency
I this is based on the Borel-Cantelli LemmaI let (Ai )i≥1 be a series of eventsI define [Ai i .o.] = ∩∞i=1 ∪∞j=i Ai (infinitely often)I the lemma states that
if∑
i
P {Ai} L∗ + �} L∗ + �} i .o.]} = 0
I we have {L(gn)→ L∗} = ∩∞j=1 ∩∞i=1 ∪∞n=i{L (gn) ≤ L∗ +1j }
I and therefore, L(gn)p.s.−−→ L∗
20 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Summary
I from n i.i.d. observations Dn = (Xi ,Yi)ni=1, a ML algorithmbuilds a model gn
I its risks is L(gn) = E {c(gn(X ),Y ) | Dn}I statistical learning studies L(gn)− L∗ and E {L(gn)} − L∗
I the preferred approach is to show a fast decrease ofP {L (gn) > L∗ + �} with n
I no hypothesis on the data (i.e., on P), a.k.a. distributionfree results
21 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Classical statisticsEven though we are fequentists...
I this program is quite different from classical statistics:I no optimal parameters, no identifiability: intrinsically non
parametricI no optimal model: only optimal performancesI no asymptotic distribution: focus on finite distance
inequalitiesI no (or minimal) assumption on the data distribution
I minimal hypothesisI are justified by our lack of knowledge on the studied
phenomenonI have strong and annoying consequences linked to the
worst case scenario they imply
22 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Optimal modelI there is nevertheless an optimal modelI regression:
I for the quadratic cost c(g(x),y) = ‖g(x)− y‖2I the minimum risk L∗ is reached by g(x) = E {Y | X = x}
I classification:I we need the posterior probabilities: P {Y = k | X = x}I the minimum risk L∗ (the Bayes risk) is reached by the
Bayes classifier: assign x to the most probable classarg maxk P {Y = k | X = x}
I we only need to mimic the prediction of the model, not themodel itself:
I for optimal classification, we don’t need to compute theposterior probabilities
I for the binary case, the decision is based onP {Y = −1 | X = x} − P {Y = 1 | X = x}
I we only need to agree with the sign of E {Y | X = x}
23 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Optimal modelI there is nevertheless an optimal modelI regression:
I for the quadratic cost c(g(x),y) = ‖g(x)− y‖2I the minimum risk L∗ is reached by g(x) = E {Y | X = x}
I classification:I we need the posterior probabilities: P {Y = k | X = x}I the minimum risk L∗ (the Bayes risk) is reached by the
Bayes classifier: assign x to the most probable classarg maxk P {Y = k | X = x}
I we only need to mimic the prediction of the model, not themodel itself:
I for optimal classification, we don’t need to compute theposterior probabilities
I for the binary case, the decision is based onP {Y = −1 | X = x} − P {Y = 1 | X = x}
I we only need to agree with the sign of E {Y | X = x}
23 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Optimal modelI there is nevertheless an optimal modelI regression:
I for the quadratic cost c(g(x),y) = ‖g(x)− y‖2I the minimum risk L∗ is reached by g(x) = E {Y | X = x}
I classification:I we need the posterior probabilities: P {Y = k | X = x}I the minimum risk L∗ (the Bayes risk) is reached by the
Bayes classifier: assign x to the most probable classarg maxk P {Y = k | X = x}
I we only need to mimic the prediction of the model, not themodel itself:
I for optimal classification, we don’t need to compute theposterior probabilities
I for the binary case, the decision is based onP {Y = −1 | X = x} − P {Y = 1 | X = x}
I we only need to agree with the sign of E {Y | X = x}
23 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
No free lunchI unfortunately, making no hypothesis on P costs a lot
I no free lunch results for binary classification andmisclassification cost
I there is always a bad distribution:I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and
E {L(gn)} ≥12− �
I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and
E {L(gn)} ≥ an
24 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
No free lunchI unfortunately, making no hypothesis on P costs a lotI no free lunch results for binary classification and
misclassification cost
I there is always a bad distribution:I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and
E {L(gn)} ≥12− �
I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and
E {L(gn)} ≥ an
24 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
No free lunchI unfortunately, making no hypothesis on P costs a lotI no free lunch results for binary classification and
misclassification costI there is always a bad distribution:
I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and
E {L(gn)} ≥12− �
I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and
E {L(gn)} ≥ an
24 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
No free lunchI unfortunately, making no hypothesis on P costs a lotI no free lunch results for binary classification and
misclassification costI there is always a bad distribution:
I given a fixed machine learning methodI for all � > 0 and all n, there is (X ,Y ) such that L∗ = 0 and
E {L(gn)} ≥12− �
I arbitrary slow convergence always happens:I given a fixed machine learning methodI and a decreasing series (an) with limit 0 (and a1 ≤ 1/16)I there is (X ,Y ) such that L∗ = 0 and
E {L(gn)} ≥ an
24 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Estimation difficultiesI in fact, estimating the Bayes risk L∗ (in binary
classification) is difficult:I given a fixed estimation algorithm for L∗, denoted L̂nI for all � > 0 and all n, there is (X ,Y ) such that
E{|L̂n − L∗|
}≥ 1
4− �
I regression is more difficult than classification:I ηn estimator for η = E {Y |X} in a weak sense:
limn→∞
E{
(ηn(X )− η(X ))2}
= 0
I then for gn(x) = I{ηn(x)>1/2}, we have
limn→∞
E {L(gn)} − L∗√E {(ηn(X )− η(X ))2}
= 0
25 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Estimation difficultiesI in fact, estimating the Bayes risk L∗ (in binary
classification) is difficult:I given a fixed estimation algorithm for L∗, denoted L̂nI for all � > 0 and all n, there is (X ,Y ) such that
E{|L̂n − L∗|
}≥ 1
4− �
I regression is more difficult than classification:I ηn estimator for η = E {Y |X} in a weak sense:
limn→∞
E{
(ηn(X )− η(X ))2}
= 0
I then for gn(x) = I{ηn(x)>1/2}, we have
limn→∞
E {L(gn)} − L∗√E {(ηn(X )− η(X ))2}
= 0
25 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Consequences
I no free lunch results show that we cannot get uniformconvergence speed:
I they don’t rule out universal (strong) consistency!I they don’t rule out PAO with hypothesis on P
I intuitive explanation:I stationarity and independence are not sufficientI if P is arbitrarily complex, then no generalization can occur
from a finite learning setI e.g., Y = f (X ) + ε: interpolation needs regularity
assumptions on f , extrapolation needs even strongerhypothesis
I but we don’t want hypothesis on P...
26 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Consequences
I no free lunch results show that we cannot get uniformconvergence speed:
I they don’t rule out universal (strong) consistency!I they don’t rule out PAO with hypothesis on P
I intuitive explanation:I stationarity and independence are not sufficientI if P is arbitrarily complex, then no generalization can occur
from a finite learning setI e.g., Y = f (X ) + ε: interpolation needs regularity
assumptions on f , extrapolation needs even strongerhypothesis
I but we don’t want hypothesis on P...
26 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Consequences
I no free lunch results show that we cannot get uniformconvergence speed:
I they don’t rule out universal (strong) consistency!I they don’t rule out PAO with hypothesis on P
I intuitive explanation:I stationarity and independence are not sufficientI if P is arbitrarily complex, then no generalization can occur
from a finite learning setI e.g., Y = f (X ) + ε: interpolation needs regularity
assumptions on f , extrapolation needs even strongerhypothesis
I but we don’t want hypothesis on P...
26 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Estimation and approximation
I solution: split the problem in two parts, estimation andapproximation
I estimation:I restrict the class of acceptable models to G, a set of
measurable functions from X to YI study L(gn)− infg∈G L(g) when the ML algorithm must
choose gn in GI statistics needed!I hypotheses on G replace hypotheses on P and allow
uniform results!I approximation:
I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!
27 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Estimation and approximation
I solution: split the problem in two parts, estimation andapproximation
I estimation:I restrict the class of acceptable models to G, a set of
measurable functions from X to YI study L(gn)− infg∈G L(g) when the ML algorithm must
choose gn in GI statistics needed!I hypotheses on G replace hypotheses on P and allow
uniform results!
I approximation:I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!
27 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Estimation and approximation
I solution: split the problem in two parts, estimation andapproximation
I estimation:I restrict the class of acceptable models to G, a set of
measurable functions from X to YI study L(gn)− infg∈G L(g) when the ML algorithm must
choose gn in GI statistics needed!I hypotheses on G replace hypotheses on P and allow
uniform results!I approximation:
I compare G to the set of all possible modelsI in other words, study infg∈G L(g)− L∗I function approximation results (density), no statistics!
27 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Back to machine learning algorithmsI many ad hoc algorithms:
I k nearest neighborsI tree based algorithmsI Adaboost
I but most algorithms are based on the following scheme:I fix a model class GI choose gn in G by optimizing a quality measure defined via
DnI examples:
I linear regression (X is a Hilbert space, Y = R):I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error
wn = arg minw 1nPn
i=1 (yi − 〈w , xi〉)2
I linear classification (X = Rp, Y = {1, . . . , c}):I G = {x 7→ arg maxk (Wx)k}I find W e.g., by minimizing a mean squared error for a
disjunctive coding of the classes
28 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Back to machine learning algorithmsI many ad hoc algorithms:
I k nearest neighborsI tree based algorithmsI Adaboost
I but most algorithms are based on the following scheme:I fix a model class GI choose gn in G by optimizing a quality measure defined via
Dn
I examples:I linear regression (X is a Hilbert space, Y = R):
I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error
wn = arg minw 1nPn
i=1 (yi − 〈w , xi〉)2
I linear classification (X = Rp, Y = {1, . . . , c}):I G = {x 7→ arg maxk (Wx)k}I find W e.g., by minimizing a mean squared error for a
disjunctive coding of the classes
28 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Back to machine learning algorithmsI many ad hoc algorithms:
I k nearest neighborsI tree based algorithmsI Adaboost
I but most algorithms are based on the following scheme:I fix a model class GI choose gn in G by optimizing a quality measure defined via
DnI examples:
I linear regression (X is a Hilbert space, Y = R):I G = {x 7→ 〈w , x〉}I find wn by minimizing the mean squared error
wn = arg minw 1nPn
i=1 (yi − 〈w , xi〉)2
I linear classification (X = Rp, Y = {1, . . . , c}):I G = {x 7→ arg maxk (Wx)k}I find W e.g., by minimizing a mean squared error for a
disjunctive coding of the classes
28 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
CriterionI the best solution would be to choose gn by minimizing
L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L
has no reason to be convex, for instance)
I partial solution, the empirical risk:
Ln(g) =1n
n∑i=1
c(g(Xi),Yi)
I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)
I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)
29 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
CriterionI the best solution would be to choose gn by minimizing
L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L
has no reason to be convex, for instance)I partial solution, the empirical risk:
Ln(g) =1n
n∑i=1
c(g(Xi),Yi)
I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)
I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)
29 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
CriterionI the best solution would be to choose gn by minimizing
L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L
has no reason to be convex, for instance)I partial solution, the empirical risk:
Ln(g) =1n
n∑i=1
c(g(Xi),Yi)
I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)
I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)
29 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
CriterionI the best solution would be to choose gn by minimizing
L(gn) over G, but:I we cannot compute L(gn) (P is unknown!)I even if we knew L(gn), optimization might be intractable (L
has no reason to be convex, for instance)I partial solution, the empirical risk:
Ln(g) =1n
n∑i=1
c(g(Xi),Yi)
I naive justification by the strong law of large numbers(SLLN): when g is fixed, limn→∞ Ln(g) = L(g)
I the algorithmic issue is handled by replacing the empiricalrisk by an empirical loss (see part IV)
29 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Empirical risk minimizationI generic machine learning method:
I fix a model class GI choose g∗n in G by
g∗n = arg ming∈GLn(g)
I denote L∗G = infg∈G L(g)
I does that work?I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n
I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best
possible one in the class)I L∗G − L∗ is handled differently (approximation vs estimation)
30 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Empirical risk minimizationI generic machine learning method:
I fix a model class GI choose g∗n in G by
g∗n = arg ming∈GLn(g)
I denote L∗G = infg∈G L(g)I does that work?
I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n
I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best
possible one in the class)I L∗G − L∗ is handled differently (approximation vs estimation)
30 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Empirical risk minimizationI generic machine learning method:
I fix a model class GI choose g∗n in G by
g∗n = arg ming∈GLn(g)
I denote L∗G = infg∈G L(g)I does that work?
I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n
I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best
possible one in the class)
I L∗G − L∗ is handled differently (approximation vs estimation)
30 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Empirical risk minimizationI generic machine learning method:
I fix a model class GI choose g∗n in G by
g∗n = arg ming∈GLn(g)
I denote L∗G = infg∈G L(g)I does that work?
I E {L(g∗n )} → L∗G?I E {L(g∗n )} → L∗?I the SLLN does not help as g∗n depends on n
I we are looking for bounds:I L(g∗n ) < Ln(g∗n ) + B (bound the risk using the empirical risk)I L(g∗n ) < L∗G + C (guarantee that the risk is close to the best
possible one in the class)I L∗G − L∗ is handled differently (approximation vs estimation)
30 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Complexity controlThe overfitting issue
I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G
I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier
defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting
I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L
∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and
approximation error
31 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Complexity controlThe overfitting issue
I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G
I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier
defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting
I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L
∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and
approximation error
31 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Complexity controlThe overfitting issue
I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G
I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier
defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting
I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L
∗ (for arbitrary P)
I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and
approximation error
31 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Complexity controlThe overfitting issue
I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G
I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier
defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting
I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L
∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)
I in other words, how to balance estimation error andapproximation error
31 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Complexity controlThe overfitting issue
I controlling Ln(g∗n)− L(g∗n) is not possible in arbitraryclasses G
I a simple example:I binary classificationI G: all piecewise functions on arbitrary partitions of XI obviously Ln(g∗n ) = 0 if PX has a density (via the classifier
defined by the 1-nn rule)I but L(g∗n ) > 0 when L∗ > 0!I the 1-nn rule is overfitting
I the only solution is to reduce the “size” of G (see part II)I this implies L∗G > L
∗ (for arbitrary P)I how to reach L∗? (will be studied in part III)I in other words, how to balance estimation error and
approximation error
31 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Increasing complexity
I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize
I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when
g∗n = arg ming∈Gn Ln(gn)I approximation part:
I functional analysisI make sure that limn→∞ L∗Gn = L
∗
I related to density arguments (universal approximation)I this is the simplest approach but also the less realistic: Gn
depends only on the data size
32 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Increasing complexity
I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize
I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when
g∗n = arg ming∈Gn Ln(gn)
I approximation part:I functional analysisI make sure that limn→∞ L∗Gn = L
∗
I related to density arguments (universal approximation)I this is the simplest approach but also the less realistic: Gn
depends only on the data size
32 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Increasing complexity
I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize
I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when
g∗n = arg ming∈Gn Ln(gn)I approximation part:
I functional analysisI make sure that limn→∞ L∗Gn = L
∗
I related to density arguments (universal approximation)
I this is the simplest approach but also the less realistic: Gndepends only on the data size
32 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Increasing complexity
I key idea: choose gn in Gn, i.e., in a class that depends onthe datasize
I estimation part:I statisticsI make sure that Ln(g∗n )− L(g∗n ) is under control when
g∗n = arg ming∈Gn Ln(gn)I approximation part:
I functional analysisI make sure that limn→∞ L∗Gn = L
∗
I related to density arguments (universal approximation)I this is the simplest approach but also the less realistic: Gn
depends only on the data size
32 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Structural risk minimizationI key idea: use again more and more powerful classes Gd
but also compare models between classes
I more precisely:I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d
Ln(g∗n,d ) + r(n,d)
I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk
I it corresponds to a complexity measure for GdI more interesting than the increasing complexity solution,
but rather unrealistic on a practical point of view becauseof the tremendous computational cost
33 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Structural risk minimizationI key idea: use again more and more powerful classes Gd
but also compare models between classesI more precisely:
I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d
Ln(g∗n,d ) + r(n,d)
I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk
I it corresponds to a complexity measure for GdI more interesting than the increasing complexity solution,
but rather unrealistic on a practical point of view becauseof the tremendous computational cost
33 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Structural risk minimizationI key idea: use again more and more powerful classes Gd
but also compare models between classesI more precisely:
I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d
Ln(g∗n,d ) + r(n,d)
I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk
I it corresponds to a complexity measure for Gd
I more interesting than the increasing complexity solution,but rather unrealistic on a practical point of view becauseof the tremendous computational cost
33 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Structural risk minimizationI key idea: use again more and more powerful classes Gd
but also compare models between classesI more precisely:
I compute g∗n,d by empirical risk minimization in GdI choose g∗n by minimizing over d
Ln(g∗n,d ) + r(n,d)
I r(n,d) is a penalty term:I it decreases with n and increases with the “complexity” ofGd to favor models for which the risk is correctly estimatedby the empirical risk
I it corresponds to a complexity measure for GdI more interesting than the increasing complexity solution,
but rather unrealistic on a practical point of view becauseof the tremendous computational cost
33 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Regularization
I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)
I more precisely:I choose G such that L∗G = L∗I define g∗n by
g∗n = arg ming∈G(Ln(g) + λnR(g))
I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must
be carefully chosenI this is the (one of the) best solution(s)
34 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Regularization
I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)
I more precisely:I choose G such that L∗G = L∗I define g∗n by
g∗n = arg ming∈G(Ln(g) + λnR(g))
I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must
be carefully chosenI this is the (one of the) best solution(s)
34 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Regularization
I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)
I more precisely:I choose G such that L∗G = L∗I define g∗n by
g∗n = arg ming∈G(Ln(g) + λnR(g))
I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must
be carefully chosen
I this is the (one of the) best solution(s)
34 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Regularization
I key idea: use a large class but optimize a penalizedversion of the empirical risk (see part IV)
I more precisely:I choose G such that L∗G = L∗I define g∗n by
g∗n = arg ming∈G(Ln(g) + λnR(g))
I R(g) measures the complexity of the modelI the trade-off between the risk and the complexity, λn, must
be carefully chosenI this is the (one of the) best solution(s)
34 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
SummaryI from n i.i.d. observations Dn = (Xi ,Yi)ni=1, a ML algorithm
builds a model gnI a good ML algorithm builds a universally strongly
consistent model, i.e. for all P,
L(gn)p.s.−−→ L∗
I a very general class of ML algorithms is based onempirical risk minimization, i.e.
g∗n = arg ming∈G1n
n∑i=1
c(g(Xi),Yi)
I consistency is obtained by controlling the estimation errorand the approximation error
35 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Outline
I part II: empirical risk and capacity measuresI part III: capacity controlI part IV: empirical loss and regularization
I not in this lecture (see the references):I ad hoc methods such as k nearest neighbours and treesI advanced topics such as noise conditions and data
dependent boundsI clusteringI Bayesian point of viewI and many other things...
36 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Outline
I part II: empirical risk and capacity measuresI part III: capacity controlI part IV: empirical loss and regularizationI not in this lecture (see the references):
I ad hoc methods such as k nearest neighbours and treesI advanced topics such as noise conditions and data
dependent boundsI clusteringI Bayesian point of viewI and many other things...
36 / 154 Fabrice Rossi Formalization
http://apiacoa.org/
Part II
Empirical risk and capacity measures
37 / 154 Fabrice Rossi
http://apiacoa.org/
Outline
ConcentrationHoeffding inequalityUniform bounds
Vapnik-Chervonenkis DimensionDefinitionApplication to classificationProof
Covering numbersDefinition and resultsComputing covering numbers
Summary
38 / 154 Fabrice Rossi
http://apiacoa.org/
Empirical risk
I Empirical risk minimisation (ERM) relies on the linkbetween Ln(g) and L(g)
I the law of large numbers is too limited:I asymptotic onlyI g must be fixed
I we need:1. quantitative finite distance results, i.e., PAO like bounds on
P {|Ln(g)− L(g)| > �}
2. uniform results, i.e. bounds on
P
{supg∈G|Ln(g)− L(g)| > �
}
39 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Concentration inequalities
I in essence, the goal is to evaluate the concentration of theempirical mean of c(g(X ),Y ) around its expectation
Ln(g)− L(g) =1n
n∑i=1
c(g(Xi),Yi)− E {c(g(X ),Y )}
I some standard bounds:I Markov: P {|X | ≥ t} ≤ E{|X |}tI Bienaymé-Chebyshev: P {|X − E {X}| ≥ t} ≤ Var(X)t2I BC gives for n independent real valued random variables
Xi , denoting Sn =∑n
i=1 Xi :
P {|Sn − E {Sn}| ≥ t} ≤∑n
i=1 Var(Xi )t2
40 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Concentration inequalities
I in essence, the goal is to evaluate the concentration of theempirical mean of c(g(X ),Y ) around its expectation
Ln(g)− L(g) =1n
n∑i=1
c(g(Xi),Yi)− E {c(g(X ),Y )}
I some standard bounds:I Markov: P {|X | ≥ t} ≤ E{|X |}tI Bienaymé-Chebyshev: P {|X − E {X}| ≥ t} ≤ Var(X)t2I BC gives for n independent real valued random variables
Xi , denoting Sn =∑n
i=1 Xi :
P {|Sn − E {Sn}| ≥ t} ≤∑n
i=1 Var(Xi )t2
40 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Hoeffding’s inequalityHoeffding, 1963
hypothesis
I X1, . . . ,Xn, n independent r.v.I Xi ∈ [ai ,bi ]I Sn =
∑ni=1 Xi
result
P {Sn − E {Sn} ≥ �} ≤ e−2�2/Pn
i=1(bi−ai )2
P {Sn − E {Sn} ≤ −�} ≤ e−2�2/Pn
i=1(bi−ai )2
Quantitative distribution free finite distance law of largenumbers!
41 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as
I a bounded Y in regressionI or the misclassification cost in binary classification
[a,b] = [0,1]
I Ui = 1n c(g(Xi),Yi), Ui ∈ [an ,
bn ]
I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)
I∑n
i=1(bi − ai)2 =(b−a)2
nI and therefore
P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2
I then Ln(g)P−→ L(g) and Ln(g)
p.s.−−→ L(g) (Borel Cantelli)
42 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as
I a bounded Y in regressionI or the misclassification cost in binary classification
[a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [
an ,
bn ]
I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)
I∑n
i=1(bi − ai)2 =(b−a)2
nI and therefore
P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2
I then Ln(g)P−→ L(g) and Ln(g)
p.s.−−→ L(g) (Borel Cantelli)
42 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as
I a bounded Y in regressionI or the misclassification cost in binary classification
[a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [
an ,
bn ]
I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)
I∑n
i=1(bi − ai)2 =(b−a)2
nI and therefore
P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2
I then Ln(g)P−→ L(g) and Ln(g)
p.s.−−→ L(g) (Borel Cantelli)
42 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Direct applicationI with a bounded cost c(u, v) ∈ [a,b] (for all (u, v)), such as
I a bounded Y in regressionI or the misclassification cost in binary classification
[a,b] = [0,1]I Ui = 1n c(g(Xi),Yi), Ui ∈ [
an ,
bn ]
I Warning: g cannot depend on the (Xi ,Yi) (or the Ui are nolonger independent)
I∑n
i=1(bi − ai)2 =(b−a)2
nI and therefore
P {|Ln(g)− L(g)| ≥ �} ≤ 2e−2n�2/(b−a)2
I then Ln(g)P−→ L(g) and Ln(g)
p.s.−−→ L(g) (Borel Cantelli)
42 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Limitations
I g cannot depend on the (Xi ,Yi) (independent test set)I intuitively:
I δ = 2e−2n�2/(b−a)2
I for each g, the probability to draw from P a Dn on which
|Ln(g)− L(g)| ≤ (b − a)
√log 2δ2n
is at least 1− δI but the sets of “correct” Dn differ for each gI conversely for a fixed Dn, the bound is valid only for some of
the modelsI this cannot be used to study g∗n = arg ming∈G Ln(g)
43 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
The test setI Hoeffding’s inequality justifies the test set approach (hold
out estimate)I Given a dataset Dm:
I split the dataset into two disjoint sets: a learning set Dn anda test set D′p (with p + n = m)
I use a ML algorithm to build gn using only DnI claim that L(gn) ' L′p(gn) = 1p
∑mi=n+1 c(gn(Xi ),Yi )
I In fact, with probability at least 1− δ
L(gn) ≤ L′p(gn) + (b − a)
√log 1δ2p
I Example:I classification (a = 0, b = 1), with a “good” classifier
L′p(gn) = 0.05I to be 99% sure that L(gn) ≤ 0.06, we need p = 23000 (!!!)
44 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
The test setI Hoeffding’s inequality justifies the test set approach (hold
out estimate)I Given a dataset Dm:
I split the dataset into two disjoint sets: a learning set Dn anda test set D′p (with p + n = m)
I use a ML algorithm to build gn using only DnI claim that L(gn) ' L′p(gn) = 1p
∑mi=n+1 c(gn(Xi ),Yi )
I In fact, with probability at least 1− δ
L(gn) ≤ L′p(gn) + (b − a)
√log 1δ2p
I Example:I classification (a = 0, b = 1), with a “good” classifier
L′p(gn) = 0.05I to be 99% sure that L(gn) ≤ 0.06, we need p = 23000 (!!!)
44 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
The test setI Hoeffding’s inequality justifies the test set approach (hold
out estimate)I Given a dataset Dm:
I split the dataset into two disjoint sets: a learning set Dn anda test set D′p (with p + n = m)
I use a ML algorithm to build gn using only DnI claim that L(gn) ' L′p(gn) = 1p
∑mi=n+1 c(gn(Xi ),Yi )
I In fact, with probability at least 1− δ
L(gn) ≤ L′p(gn) + (b − a)
√log 1δ2p
I Example:I classification (a = 0, b = 1), with a “good” classifier
L′p(gn) = 0.05I to be 99% sure that L(gn) ≤ 0.06, we need p = 23000 (!!!)
44 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Proof of the Hoeffding’s inequality
I Chernoff’s bounding technique (an application of Markov’sinequality):
P {X ≥ t} = P{
esX ≥ est}≤
E{
esX}
est
the bound is controlled via sI lemma: if E {X} = 0 and X ∈ [a,b], the for all s,
E{
esX}≤ e
s2(b−a)28
I e is convex⇒ E{
esX}≤ eφ(s(b−a))
I then we bound φI this is used for X = Sn − E {Sn}
45 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Proof of the Hoeffding’s inequality
P {Sn − E {Sn} ≥ �} ≤ e−s�E{
esPn
i=1(Xi−E{Xi})}
= e−s�n∏
i=1
E{
es(Xi−E{Xi})}
(independence)
≤ e−s�n∏
i=1
es2(bi−ai )
2
8 (lemma)
= e−s�es2Pn
i=1(bi−ai )2
8
= e−2�2/Pn
i=1(bi−ai )2 (minimization)
with s = 4�/∑n
i=1(bi − ai)2
46 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Remarks
I Hoeffding’s inequality is useful in many other contexts:I Large-scale machine learning (e.g.,
[Domingos and Hulten, 2001]):I enormous dataset (e.g., cannot be loaded in memory)I process growing subsets until the model quality is known
with sufficient confidenceI monitor drifting via confidence bounds
I Regret in game theory (e.g.,[Cesa-Bianchi and Lugosi, 2006]):
I define strategies by minimizing a regret measureI get bounds on the regret estimations (e.g., trade-off between
exploration and exploitation in multi-arms bandits)I Many improvements/complements are available, such as:
I Bernstein’s inequalityI McDiarmid’s bounded difference inequality
47 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Uniform boundsI we deal with the estimation error:
I given the ERM choice: g∗n = arg ming∈G Ln(g)I what about L(g∗n )− infg∈G L(g)?
I we need uniform bounds:
|Ln(g∗n)− L(g∗n)| ≤ supg∈G|Ln(g)− L(g)|
L(g∗n)− infg∈G L(g) = L(g∗n)− Ln(g∗n) + Ln(g∗n)− infg∈G L(g)
≤ L(g∗n)− Ln(g∗n) + supg∈G|Ln(g)− L(g)|
≤ 2 supg∈G|Ln(g)− L(g)|
I therefore uniform bounds give an answer to bothquestions: the quality of the empirical risk as an estimateof the risk and the quality of the model select by ERM
48 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Uniform boundsI we deal with the estimation error:
I given the ERM choice: g∗n = arg ming∈G Ln(g)I what about L(g∗n )− infg∈G L(g)?
I we need uniform bounds:
|Ln(g∗n)− L(g∗n)| ≤ supg∈G|Ln(g)− L(g)|
L(g∗n)− infg∈G L(g) = L(g∗n)− Ln(g∗n) + Ln(g∗n)− infg∈G L(g)
≤ L(g∗n)− Ln(g∗n) + supg∈G|Ln(g)− L(g)|
≤ 2 supg∈G|Ln(g)− L(g)|
I therefore uniform bounds give an answer to bothquestions: the quality of the empirical risk as an estimateof the risk and the quality of the model select by ERM
48 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Uniform boundsI we deal with the estimation error:
I given the ERM choice: g∗n = arg ming∈G Ln(g)I what about L(g∗n )− infg∈G L(g)?
I we need uniform bounds:
|Ln(g∗n)− L(g∗n)| ≤ supg∈G|Ln(g)− L(g)|
L(g∗n)− infg∈G L(g) = L(g∗n)− Ln(g∗n) + Ln(g∗n)− infg∈G L(g)
≤ L(g∗n)− Ln(g∗n) + supg∈G|Ln(g)− L(g)|
≤ 2 supg∈G|Ln(g)− L(g)|
I therefore uniform bounds give an answer to bothquestions: the quality of the empirical risk as an estimateof the risk and the quality of the model select by ERM
48 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Finite model class
I union bound : P {A ou B} ≤ P {A}+ P {B} and therefore
P{
max1≤i≤m
Ui ≥ �}≤
m∑i=1
P {Ui ≥ �}
I then when G is finite
P
{supg∈G|Ln(g)− L(g)| ≥ �
}≤ 2|G|e−2n�2/(b−a)2
so L(g∗n)p.s.−−→ infg∈G L(g) (but in general infg∈G L(g) > L∗)
I quantitative distribution free uniform finite distance law oflarge numbers!
49 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Finite model class
I union bound : P {A ou B} ≤ P {A}+ P {B} and therefore
P{
max1≤i≤m
Ui ≥ �}≤
m∑i=1
P {Ui ≥ �}
I then when G is finite
P
{supg∈G|Ln(g)− L(g)| ≥ �
}≤ 2|G|e−2n�2/(b−a)2
so L(g∗n)p.s.−−→ infg∈G L(g) (but in general infg∈G L(g) > L∗)
I quantitative distribution free uniform finite distance law oflarge numbers!
49 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Finite model class
I union bound : P {A ou B} ≤ P {A}+ P {B} and therefore
P{
max1≤i≤m
Ui ≥ �}≤
m∑i=1
P {Ui ≥ �}
I then when G is finite
P
{supg∈G|Ln(g)− L(g)| ≥ �
}≤ 2|G|e−2n�2/(b−a)2
so L(g∗n)p.s.−−→ infg∈G L(g) (but in general infg∈G L(g) > L∗)
I quantitative distribution free uniform finite distance law oflarge numbers!
49 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Bound versus size
with probability at least 1− δ
L(g∗n) ≤ infg∈G L(g) + 2(b − a)
√log |G|+ log 2δ
2n
I infg∈G L(g) decreases with |G| (more models)I but the bound increases with |G|: trade-off between
flexibility and estimation qualityI remark: log |G| is the number of bits needed to encode the
model choice; this is a capacity measure
50 / 154 Fabrice Rossi Concentration
http://apiacoa.org/
Outline
ConcentrationHoeffding inequalityUniform bounds
Vapnik-Chervonenkis DimensionDefinitionApplication to classificationProof
Covering numbersDefinition and resultsComputing covering numbers
Summary
51 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Infinite model class
I so far we have uniform bounds when G is finite:I even simple classes such as Glin = {x 7→ arg maxk (Wx)k}
are infiniteI in addition infg∈Glin L(g) > 0 in general, even when L
∗ = 0(nonlinear problems)
I solution: evaluate the capacity of a class rather than itssize
I first step:I binary classification (the simplest case)I a model is then a measurable set A ⊂ XI the capacity of G measures the variability of the available
shapesI it is evaluated with respect to the data: if A ∩ Dn = B ∩ Dn,
then A and B are identical
52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Infinite model class
I so far we have uniform bounds when G is finite:I even simple classes such as Glin = {x 7→ arg maxk (Wx)k}
are infiniteI in addition infg∈Glin L(g) > 0 in general, even when L
∗ = 0(nonlinear problems)
I solution: evaluate the capacity of a class rather than itssize
I first step:I binary classification (the simplest case)I a model is then a measurable set A ⊂ XI the capacity of G measures the variability of the available
shapesI it is evaluated with respect to the data: if A ∩ Dn = B ∩ Dn,
then A and B are identical
52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Infinite model class
I so far we have uniform bounds when G is finite:I even simple classes such as Glin = {x 7→ arg maxk (Wx)k}
are infiniteI in addition infg∈Glin L(g) > 0 in general, even when L
∗ = 0(nonlinear problems)
I solution: evaluate the capacity of a class rather than itssize
I first step:I binary classification (the simplest case)I a model is then a measurable set A ⊂ XI the capacity of G measures the variability of the available
shapesI it is evaluated with respect to the data: if A ∩ Dn = B ∩ Dn,
then A and B are identical
52 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Growth functionI abstract setting: F is a set of measurable functions from
Rd to {0,1} (in other words a set of (indicator functions of)measurable subsets of Rd )
I given z1 ∈ Rd , . . . , zn ∈ Rd , we define
Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}
I interpretation: each u describes a binary partition ofz1, . . . , zn and Fz1,...,zn is then the set of partitionsimplemented by F
I Growth function
SF (n) = supz1∈Rd ,...,zn∈Rd
|Fz1,...,zn |
I interpretation: maximal number of binary partitionsimplementable via F (distribution free⇒ worst caseanalysis)
53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Growth functionI abstract setting: F is a set of measurable functions from
Rd to {0,1} (in other words a set of (indicator functions of)measurable subsets of Rd )
I given z1 ∈ Rd , . . . , zn ∈ Rd , we define
Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}
I interpretation: each u describes a binary partition ofz1, . . . , zn and Fz1,...,zn is then the set of partitionsimplemented by F
I Growth function
SF (n) = supz1∈Rd ,...,zn∈Rd
|Fz1,...,zn |
I interpretation: maximal number of binary partitionsimplementable via F (distribution free⇒ worst caseanalysis)
53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Growth functionI abstract setting: F is a set of measurable functions from
Rd to {0,1} (in other words a set of (indicator functions of)measurable subsets of Rd )
I given z1 ∈ Rd , . . . , zn ∈ Rd , we define
Fz1,...,zn = {u ∈ {0,1}n | ∃f ∈ F , u = (f (z1), . . . , f (zn))}
I interpretation: each u describes a binary partition ofz1, . . . , zn and Fz1,...,zn is then the set of partitionsimplemented by F
I Growth function
SF (n) = supz1∈Rd ,...,zn∈Rd
|Fz1,...,zn |
I interpretation: maximal number of binary partitionsimplementable via F (distribution free⇒ worst caseanalysis)
53 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Vapnik-Chervonenkis dimension
I SF (n) ≤ 2nI vocabulary:
I SF (n) is the n-th shatter coefficient of FI if |Fz1,...,zn | = 2n, F shatters z1, . . . , zn
I Vapnik-Chervonenkis dimension
VCdim (F) = sup{n ∈ N+ | SF (n) = 2n}
I interpretation:I if SF (n) < 2n :
I for all z1, . . . , zn, there is a partition u ∈ {0, 1}n, such thatu 6∈ Fz1,...,zn
I for any dataset, F can fail to reach L∗ = 0I if SF (n) = 2n: there is at least one set z1, . . . , zn shattered
by F
54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Vapnik-Chervonenkis dimension
I SF (n) ≤ 2nI vocabulary:
I SF (n) is the n-th shatter coefficient of FI if |Fz1,...,zn | = 2n, F shatters z1, . . . , zn
I Vapnik-Chervonenkis dimension
VCdim (F) = sup{n ∈ N+ | SF (n) = 2n}
I interpretation:I if SF (n) < 2n :
I for all z1, . . . , zn, there is a partition u ∈ {0, 1}n, such thatu 6∈ Fz1,...,zn
I for any dataset, F can fail to reach L∗ = 0I if SF (n) = 2n: there is at least one set z1, . . . , zn shattered
by F
54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Vapnik-Chervonenkis dimension
I SF (n) ≤ 2nI vocabulary:
I SF (n) is the n-th shatter coefficient of FI if |Fz1,...,zn | = 2n, F shatters z1, . . . , zn
I Vapnik-Chervonenkis dimension
VCdim (F) = sup{n ∈ N+ | SF (n) = 2n}
I interpretation:I if SF (n) < 2n :
I for all z1, . . . , zn, there is a partition u ∈ {0, 1}n, such thatu 6∈ Fz1,...,zn
I for any dataset, F can fail to reach L∗ = 0I if SF (n) = 2n: there is at least one set z1, . . . , zn shattered
by F
54 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Example
I points from R2
I F = {I{ax+by+c≥0}}
I SF (3) = 23
I even if we havesometimes |Fz1,z2,z3 | < 23
I SF (4) < 24
I VCdim (F) = 3
55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Example
I points from R2
I F = {I{ax+by+c≥0}}
I SF (3) = 23
I even if we havesometimes |Fz1,z2,z3 | < 23
I SF (4) < 24
I VCdim (F) = 3
55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Example
I points from R2
I F = {I{ax+by+c≥0}}
I SF (3) = 23
I even if we havesometimes |Fz1,z2,z3 | < 23
I SF (4) < 24
I VCdim (F) = 3
55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Example
I points from R2
I F = {I{ax+by+c≥0}}
I SF (3) = 23
I even if we havesometimes |Fz1,z2,z3 | < 23
I SF (4) < 24
I VCdim (F) = 3
55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Example
I points from R2
I F = {I{ax+by+c≥0}}
I SF (3) = 23
I even if we havesometimes |Fz1,z2,z3 | < 23
I SF (4) < 24
I VCdim (F) = 3
55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Example
I points from R2
I F = {I{ax+by+c≥0}}
I SF (3) = 23
I even if we havesometimes |Fz1,z2,z3 | < 23
I SF (4) < 24
I VCdim (F) = 3
55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Example
I points from R2
I F = {I{ax+by+c≥0}}
I SF (3) = 23
I even if we havesometimes |Fz1,z2,z3 | < 23
I SF (4) < 24
I VCdim (F) = 3
55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Example
I points from R2
I F = {I{ax+by+c≥0}}
I SF (3) = 23
I even if we havesometimes |Fz1,z2,z3 | < 23
I SF (4) < 24
I VCdim (F) = 3
55 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then
VCdim({
f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p
proof
I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))
I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that
∑p+1i=1 γig(zi) = 0
I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j
I consequently no z1, . . . , zp+1 can be shattered
56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then
VCdim({
f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p
proof
I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))
I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that
∑p+1i=1 γig(zi) = 0
I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j
I consequently no z1, . . . , zp+1 can be shattered
56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then
VCdim({
f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p
proof
I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))
I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that
∑p+1i=1 γig(zi) = 0
I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j
I consequently no z1, . . . , zp+1 can be shattered
56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then
VCdim({
f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p
proof
I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))
I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that
∑p+1i=1 γig(zi) = 0
I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j
I consequently no z1, . . . , zp+1 can be shattered
56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Linear modelsresultif G is a p dimensional vector space of real valued functionsdefined on Rd then
VCdim({
f : Rd → {0,1} | ∃g ∈ G, f (z) = I{g(z)≥0}})≤ p
proof
I given z1, . . . , zp+1, let F from G to Rp+1 beF (g) = (g(z1), . . . ,g(zp+1))
I as dim(F (G)) ≤ p, there is a non zero γ = (γ1, . . . , γp+1)such that
∑p+1i=1 γig(zi) = 0
I if z1, . . . , zp+1 is shattered, there is also gj such thatgj(zi) = δij and therefore γj = 0 for all j
I consequently no z1, . . . , zp+1 can be shattered
56 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
VC dimension 6= parameter number
I in the linear case F = {f (z) = I{Ppi=1 wiφi (z)≥0}}VCdim (F) = p
I but in general:I VCdim
({f (z) = I{sin(tz)≥0}}
)=∞
I one hidden layer perceptron:
G =
g(z) = Tβ0 + h∑
k=1
βk T
αk0 + d∑j=1
αkjzj
I T (a) = I{a≥0}I VCdim (G) ≥ W log2(h/4)/32 where W = dh + 2h + 1 is the
number of parameters
57 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
VC dimension 6= parameter number
I in the linear case F = {f (z) = I{Ppi=1 wiφi (z)≥0}}VCdim (F) = p
I but in general:I VCdim
({f (z) = I{sin(tz)≥0}}
)=∞
I one hidden layer perceptron:
G =
g(z) = Tβ0 + h∑
k=1
βk T
αk0 + d∑j=1
αkjzj
I T (a) = I{a≥0}I VCdim (G) ≥ W log2(h/4)/32 where W = dh + 2h + 1 is the
number of parameters
57 / 154 Fabrice Rossi Vapnik-Chervonenkis Dimension
http://apiacoa.org/
Snake oil warning
I you might read here and there things likethe VC-dimension of the class of separating
hyperplanes with margin larger than 1δ is boundedby bla bla bla
I this is plain wrong!I the class G is data independentI shattering does not take into a marginI asking for the “VC-dimension of the class of separating
hyperplanes with margin larger than 1δ ” is a nonsenseI there is an extension of the VC dimensi
Top Related