Statistique en grande dimension
Lecturer : Dalalyan A., Scribe : Thomas F.-X.
First lecture
1 Introduction
1.1 Statistique classique
Statistique paramétriques : Z1, . . . , Zn iid, avec une loi commune Pθ
On fait l’hypothèse θ ∈ Θ ⊂ Rd
• Connu : Z1, . . . , Zn et Θ
• Inconnu : θ ou Pθ
Hypothèse importante : d est fixe et n→ +∞
On sait dans ce cas que l’estimateur du MV est asymptotiquement (le plus) efficace (convergent): θMV vérifie quand n→ +∞ :
EP
[∥∥θMV − θ∥∥2]=
Cn(1 + o (1))
On estime θ à une vitesse 1√n (vitesse paramétrique)
Constat. Si d = dn t.q. limn→+∞ dn = +∞, alors toute la théorie paramétrique est inutilisable.De plus, l’estimateur du MV n’est plus le meilleur estimateur !
1.2 Statistique non paramétrique
On observe Z1, . . . , Zn iid de loi P, inconnue, telle que P ∈ {Pθ , θ ∈ Θ}, mais avec Θ soit dedimension infinie, soit de dimension d = dn finie mais→ +∞ avec la taille de l’échantillon.
Exemples:
Θ ={
f : [0, 1]→ R, f Lipschitz de constante L}
(1)
={
f : [0, 1]→ R, ∀x, y, | f (x)− f (y)| 6 L |x− y|}
(2)
Θ ={
θ = (θ1, θ2, . . .) ,∞
∑j=1
θ2j < +∞
}= `2 (3)
Démarche générale:On approche Θ par une suite croissante {Θk} de sous-ensembles de Θ telle que Θk est de
1
dimension dk. En procédant comme si θ appartenait à Θk (ce n’est pas nécessairement le cas),on utilise une méthode paramétrique pour définir un estimateur θk de θ. Cela nous donne unefamille d’estimateurs
{θk}
.
Question principale. Comment choisir k pour minimiser le risque de θk ?
• Si k est petit, on est face à un phénomène de sous-apprentissage (underfitting)
• Inversement, si k est grand, phénomène de sur-apprentissage (overfitting)
1.3 Principal models in non-parametric statistics
Density model. We have X1, . . . , Xn iid with a density f defined on Rp, and :
P (X1 ∈ A) =∫
Af (x)dx
The assumptions imposed on f are very weak as opposed to the parametric setting. Forinstance, a typical assumption in parametric setting is that f is the Gaussian density :
f (x) =det
(Σ−1)
(2π)p/2 exp[−1
2(x− µ)T Σ−1(x− µ)
],
whereas a common assumption on f in nonparametric framework is : f is smooth, say,twice continuously differentiable with a bounded second derivative.
Regression model. We observe Zi = (Xi, Yi), with input Xi, output Yi and error εi :
Yi = f (Xi) + εi.
The function f is called the regression function. Here, the goal is to estimate f withoutassuming any parametric structure on it.
Practical examples.
Marketing.
• Each i represents a consumer
• Xi are the features of the consumer
A typical question is “how do I estimate different relevant groups of consumers”. A typicalanswer is then to use clustering algorithms. We assume that X1, . . . , Xn are iid with densityf . Then, we estimate f in a non-parametric manner by f . The clusters are defined as regionsaround the local maxima of the function f .
1.4 Machine Learning
• Essentially the same as non-parametric statistics
• The main focus here is on the algorithms (rather than on the models), their statisticalperformance and their computational complexity.
2
2 Main concepts and notations
Observations : Z1, . . . , Zn iid, P
• Non-supervised learning : Zi = Xi
• Supervised learning : Zi = (Xi, Yi), where Xi is an example or a feature, and Yi a label.
Aim. To learn the distribution P or some properties of it.
Prediction. We assume that a new feature X (from the same prob. distribution as X1, . . . , Xn) isobserved. The aim is to predict the label associated to X.
To measure the quality of a prediction, we need a loss function ` (y, y) (y is the true label, yis the predicted label). In practice, both y and y are random variables, furthermore y and itsdistribution are unknown, so ` is hard to compute!
Risk function. This is the expectation of the loss.
Definition 1 Assume that Zi = (Xi, Yi) ∈ X × Y and ` : Y × Y → R is a loss function. A predictor,or preduction algorithm, is any mapping :
g : (X ×Y)n → YX
The risk of the prediction function g is :
RP[ g ] = EP [` (Y, g(X))]
The risk of a predictor g is RP[ g ], which is random since g depends on the data.
RP[ g ] =∫X×Y
` (y, g(x))dP (x, y)
Examples:
1. Binary classification: Y = {0, 1}, with any X
` (y, y) ={
0, if y = y1, otherwise
= 1 (y 6= y) = (y− y)2.
2. Least-squares regression: Y ⊂ R, with any X
` (y, y) = (y− y)2 .
3 Excess risk and Bayes predictor
We have Zi = (Xi, Yi)
RP[ g ] =∫X×Y
` (y, g(x))P (dx, dy)
P (dx, dy) = PY|X (dy|X = x)PX (dx)
Definition 2 Given a loss function ` : Y × Y → R, the Bayes predictor, or “oracle” is the predictionfunction minimizing the risk :
g∗ ∈ arg ming∈YX
RP[ g ]
3
Remark 1 In practice, g∗ is unavailable, since it depends on P, which is unknown. The ultimategoal is to do almost as well as the oracle.
A predictor gn will be considered as a good one if :
limn→+∞
RP [gn]−RP [g∗]︸ ︷︷ ︸excess risk
= 0
Definition 3 We say that the predictor gn is consistent (universally consistent) if ∀P, we have :
limn→+∞
EP [RP [gn]]−RP [g∗] = 0
Theorem 1
1. Suppose that ∀x ∈ X ,the infimum of y 7→ EP [` (Y , y) |X = x] is reached. Then the funcion g∗
defined by :g∗(x) ∈ arg min
y∈YEP [` (Y, y) |X = x]
...is a Bayes predictor.
2. In the case of the binary classification, Y = {0, 1} and ` (y, y) = 1 (y 6= y),
g∗(x) = 1(
η∗ (x) >12
)where η∗ (x) = P [Y = 1|X = x] .
Furthermore, the excess risk can be computed by
RP [g]−RP [g∗] = EP [(g(X)− g∗(X)) (1− 2η∗(X))] . (4)
3. In the case of the least squares regression,
g∗(x) = η∗(x) where η∗(x) = EP [Y|X = x]
Furthermore, for any η : X → Y , we have :
RP[ η ]−RP [η∗] = EP
[(η (X)− η∗ (X))2
]Proof
1. Let g ∈ YX and let :g∗(x) ∈ arg min
y∈YEP [` (Y, y) |X = x] .
We have :
RP[ g ] = EP [` (Y, g (X))]
=∫EP [` (Y, g (X)) |X = x]PX (dx)
>∫EP [` (Y, g∗(x)) |X = x]PX (dx)
= RP [g∗] .
4
2. Using the first assertion,
g∗(x) ∈ arg miny∈{0,1}
EP [1 (Y 6= y) |X = x]
= arg miny∈{0,1}
P (Y 6= y|X = x)
= arg maxy∈{0,1}
P (Y = y|X = x)
= arg maxy∈{0,1}
{η∗(x)1(y = 1) + (1− η∗(x))1(y = 0)
}.
Therefore,
g∗(x) ={
0, if P (Y = 1|X = x) 6 12
1, otherwise.
To check (4), it suffices to remark that
RP [g] = EP[(g(X)−Y)2] = EP[g(X)2] +EP[Y2]− 2EP[Yg(X)]
= EP[g(X)] +EP[Y]− 2EP[EP(Yg(X)|X)]
= EP[g(X)] +EP[Y]− 2EP[g(X)EP(Y|X)]
= EP[g(X)] +EP[Y]− 2EP[g(X)η∗P(X)]
= EP[g(X)(1− 2η∗P(X)] +EP[Y].
Writing the same identity for g∗P and making the difference of these two identities, we getthe desired result.
3. In view of the first assertion of the theorem, we have:
g∗(x) ∈ arg miny∈R
EP
[(Y− y)2 |X = x
]= arg min
y∈Rϕ (y)
where ϕ (y) = EP[Y2|X = x
]− 2yEP [Y|X = x] + y2 is a second order polynomial. The
minimization of such a polynomial is straightforward and leads to:
arg miny∈R
ϕ (y) = EP [Y|X = x] .
This shows that the Bayes predictor is equal to the regression function η∗(x). The risk ofthis predictor is:
RP[ η ] = EP
[(Y− η (X))2
]= EP
(EP
[(Y− η (X))2 |X
])= EP
(EP
[(Y− η∗ (X))2 |X
]+ 2EP [(Y− η∗ (X)) (η∗ − η) (X)|X] + (η∗ − η)2 (X)
)= RP [η∗] + 0 +EP
[(η∗ − η)2 (X)
],
where the cross-product term vanishes since
EP [(Y− η∗ (X)) (η∗ − η) (X)|X] = (η∗ − η) (X)EP [(Y− η∗ (X)) |X] = 0.
This completes the proof of the theorem. �
5
3.1 Link between Binary Classification & Regression
Plug-in rule
• We start by estimating η∗(x) by ηn(x),
• We define gn(x) = 1(
ηn > 12
).
Question: How good the plug-in rule gn is ?
Proposition 1 Let η be an estimator of the regression function η∗, and let g(x) = 1(
η (x) > 12
). Then,
we have :Rclass [g]−Rclass [g∗] 6 2
√Rreg [η]−Rreg [η∗]
Proof Let η : X → Y ⊂ R, and g(x) = 1(
η(x) > 12
), and let’s compute the excess risk of g. We
have,
Rclass [g]−Rclass [g∗] = EP [(g(X)− g∗(X)) (1− 2η∗(X))] .
Since g and g∗ are both indicator functions and, therefore, take only the values 0 and 1, theirdifference will be nonzero if and only if one of them is equal to 1 and the other one is equal to0. This leads to
Rclass[g]−Rclass ≤ EP
[1(η(X) 6 1/2 < η∗(X)
)∣∣2η∗(X)− 1∣∣]
+EP[1(η∗(X) 6 1/2 < η(X)
)∣∣2η∗(X)− 1∣∣]
= 2EP[1(1/2 ∈ [η∗(X), η(X)]
)∣∣η∗(X)− 1/2∣∣]
If η(X) 6 1/2 and η∗(X) > 1/2, then∣∣η∗(X)− 1/2
∣∣ 6 ∣∣η∗(X)− η(X)∣∣, and thus :
Rclass[g]−Rclass
[g∗]6 2EP
[1(1/2 ∈ [η(X), η∗(X)]
)∣∣η(X)− η∗(X)∣∣]
6 2EP[∣∣η(X)− η∗(X)
∣∣]6 2
√EP[(
η(X)− η∗(X))2]
= 2√Rreg(η)−Rreg(η∗).
Since this inequality is true for every deterministic η, we get the desired property. �
6
Top Related