Download - Statistique en grande dimension - IMAGINEcertis.enpc.fr/~dalalyan/Download/Cours_1_MVA.pdf · Statistique en grande dimension Lecturer : Dalalyan A., Scribe : Thomas F.-X. First lecture

Statistique en grande dimension

Lecturer : Dalalyan A., Scribe : Thomas F.-X.

First lecture

1 Introduction

1.1 Statistique classique

Statistique paramétriques : Z1, . . . , Zn iid, avec une loi commune Pθ

On fait l’hypothèse θ ∈ Θ ⊂ Rd

• Connu : Z1, . . . , Zn et Θ

• Inconnu : θ ou Pθ

Hypothèse importante : d est fixe et n→ +∞

On sait dans ce cas que l’estimateur du MV est asymptotiquement (le plus) efficace (convergent): θMV vérifie quand n→ +∞ :

EP

[∥∥θMV − θ∥∥2]=

Cn(1 + o (1))

On estime θ à une vitesse 1√n (vitesse paramétrique)

Constat. Si d = dn t.q. limn→+∞ dn = +∞, alors toute la théorie paramétrique est inutilisable.De plus, l’estimateur du MV n’est plus le meilleur estimateur !

1.2 Statistique non paramétrique

On observe Z1, . . . , Zn iid de loi P, inconnue, telle que P ∈ {Pθ , θ ∈ Θ}, mais avec Θ soit dedimension infinie, soit de dimension d = dn finie mais→ +∞ avec la taille de l’échantillon.

Exemples:

Θ ={

f : [0, 1]→ R, f Lipschitz de constante L}

(1)

={

f : [0, 1]→ R, ∀x, y, | f (x)− f (y)| 6 L |x− y|}

(2)

Θ ={

θ = (θ1, θ2, . . .) ,∞

∑j=1

θ2j < +∞

}= `2 (3)

Démarche générale:On approche Θ par une suite croissante {Θk} de sous-ensembles de Θ telle que Θk est de

1

dimension dk. En procédant comme si θ appartenait à Θk (ce n’est pas nécessairement le cas),on utilise une méthode paramétrique pour définir un estimateur θk de θ. Cela nous donne unefamille d’estimateurs

{θk}

.

Question principale. Comment choisir k pour minimiser le risque de θk ?

• Si k est petit, on est face à un phénomène de sous-apprentissage (underfitting)

• Inversement, si k est grand, phénomène de sur-apprentissage (overfitting)

1.3 Principal models in non-parametric statistics

Density model. We have X1, . . . , Xn iid with a density f defined on Rp, and :

P (X1 ∈ A) =∫

Af (x)dx

The assumptions imposed on f are very weak as opposed to the parametric setting. Forinstance, a typical assumption in parametric setting is that f is the Gaussian density :

f (x) =det

(Σ−1)

(2π)p/2 exp[−1

2(x− µ)T Σ−1(x− µ)

],

whereas a common assumption on f in nonparametric framework is : f is smooth, say,twice continuously differentiable with a bounded second derivative.

Regression model. We observe Zi = (Xi, Yi), with input Xi, output Yi and error εi :

Yi = f (Xi) + εi.

The function f is called the regression function. Here, the goal is to estimate f withoutassuming any parametric structure on it.

Practical examples.

Marketing.

• Each i represents a consumer

• Xi are the features of the consumer

A typical question is “how do I estimate different relevant groups of consumers”. A typicalanswer is then to use clustering algorithms. We assume that X1, . . . , Xn are iid with densityf . Then, we estimate f in a non-parametric manner by f . The clusters are defined as regionsaround the local maxima of the function f .

1.4 Machine Learning

• Essentially the same as non-parametric statistics

• The main focus here is on the algorithms (rather than on the models), their statisticalperformance and their computational complexity.

2

2 Main concepts and notations

Observations : Z1, . . . , Zn iid, P

• Non-supervised learning : Zi = Xi

• Supervised learning : Zi = (Xi, Yi), where Xi is an example or a feature, and Yi a label.

Aim. To learn the distribution P or some properties of it.

Prediction. We assume that a new feature X (from the same prob. distribution as X1, . . . , Xn) isobserved. The aim is to predict the label associated to X.

To measure the quality of a prediction, we need a loss function ` (y, y) (y is the true label, yis the predicted label). In practice, both y and y are random variables, furthermore y and itsdistribution are unknown, so ` is hard to compute!

Risk function. This is the expectation of the loss.

Definition 1 Assume that Zi = (Xi, Yi) ∈ X × Y and ` : Y × Y → R is a loss function. A predictor,or preduction algorithm, is any mapping :

g : (X ×Y)n → YX

The risk of the prediction function g is :

RP[ g ] = EP [` (Y, g(X))]

The risk of a predictor g is RP[ g ], which is random since g depends on the data.

RP[ g ] =∫X×Y

` (y, g(x))dP (x, y)

Examples:

1. Binary classification: Y = {0, 1}, with any X

` (y, y) ={

0, if y = y1, otherwise

= 1 (y 6= y) = (y− y)2.

2. Least-squares regression: Y ⊂ R, with any X

` (y, y) = (y− y)2 .

3 Excess risk and Bayes predictor

We have Zi = (Xi, Yi)

RP[ g ] =∫X×Y

` (y, g(x))P (dx, dy)

P (dx, dy) = PY|X (dy|X = x)PX (dx)

Definition 2 Given a loss function ` : Y × Y → R, the Bayes predictor, or “oracle” is the predictionfunction minimizing the risk :

g∗ ∈ arg ming∈YX

RP[ g ]

3

Remark 1 In practice, g∗ is unavailable, since it depends on P, which is unknown. The ultimategoal is to do almost as well as the oracle.

A predictor gn will be considered as a good one if :

limn→+∞

RP [gn]−RP [g∗]︸︷︷︸excess risk

= 0

Definition 3 We say that the predictor gn is consistent (universally consistent) if ∀P, we have :

limn→+∞

EP [RP [gn]]−RP [g∗] = 0

Theorem 1

1. Suppose that ∀x ∈ X ,the infimum of y 7→ EP [` (Y , y) |X = x] is reached. Then the funcion g∗

defined by :g∗(x) ∈ arg min

y∈YEP [` (Y, y) |X = x]

...is a Bayes predictor.

2. In the case of the binary classification, Y = {0, 1} and ` (y, y) = 1 (y 6= y),

g∗(x) = 1(

η∗ (x) >12

)where η∗ (x) = P [Y = 1|X = x] .

Furthermore, the excess risk can be computed by

RP [g]−RP [g∗] = EP [(g(X)− g∗(X)) (1− 2η∗(X))] . (4)

3. In the case of the least squares regression,

g∗(x) = η∗(x) where η∗(x) = EP [Y|X = x]

Furthermore, for any η : X → Y , we have :

RP[ η ]−RP [η∗] = EP

[(η (X)− η∗ (X))2

]Proof

1. Let g ∈ YX and let :g∗(x) ∈ arg min

y∈YEP [` (Y, y) |X = x] .

We have :

RP[ g ] = EP [` (Y, g (X))]

=∫EP [` (Y, g (X)) |X = x]PX (dx)

>∫EP [` (Y, g∗(x)) |X = x]PX (dx)

= RP [g∗] .

4

2. Using the first assertion,

g∗(x) ∈ arg miny∈{0,1}

EP [1 (Y 6= y) |X = x]

= arg miny∈{0,1}

P (Y 6= y|X = x)

= arg maxy∈{0,1}

P (Y = y|X = x)

= arg maxy∈{0,1}

{η∗(x)1(y = 1) + (1− η∗(x))1(y = 0)

}.

Therefore,

g∗(x) ={

0, if P (Y = 1|X = x) 6 12

1, otherwise.

To check (4), it suffices to remark that

RP [g] = EP[(g(X)−Y)2] = EP[g(X)2] +EP[Y2]− 2EP[Yg(X)]

= EP[g(X)] +EP[Y]− 2EP[EP(Yg(X)|X)]

= EP[g(X)] +EP[Y]− 2EP[g(X)EP(Y|X)]

= EP[g(X)] +EP[Y]− 2EP[g(X)η∗P(X)]

= EP[g(X)(1− 2η∗P(X)] +EP[Y].

Writing the same identity for g∗P and making the difference of these two identities, we getthe desired result.

3. In view of the first assertion of the theorem, we have:

g∗(x) ∈ arg miny∈R

EP

[(Y− y)2 |X = x

]= arg min

y∈Rϕ (y)

where ϕ (y) = EP[Y2|X = x

]− 2yEP [Y|X = x] + y2 is a second order polynomial. The

minimization of such a polynomial is straightforward and leads to:

arg miny∈R

ϕ (y) = EP [Y|X = x] .

This shows that the Bayes predictor is equal to the regression function η∗(x). The risk ofthis predictor is:

RP[ η ] = EP

[(Y− η (X))2

]= EP

(EP

[(Y− η (X))2 |X

])= EP

(EP

[(Y− η∗ (X))2 |X

]+ 2EP [(Y− η∗ (X)) (η∗ − η) (X)|X] + (η∗ − η)2 (X)

)= RP [η∗] + 0 +EP

[(η∗ − η)2 (X)

],

where the cross-product term vanishes since

EP [(Y− η∗ (X)) (η∗ − η) (X)|X] = (η∗ − η) (X)EP [(Y− η∗ (X)) |X] = 0.

This completes the proof of the theorem. �

5

3.1 Link between Binary Classification & Regression

Plug-in rule

• We start by estimating η∗(x) by ηn(x),

• We define gn(x) = 1(

ηn > 12

).

Question: How good the plug-in rule gn is ?

Proposition 1 Let η be an estimator of the regression function η∗, and let g(x) = 1(

η (x) > 12

). Then,

we have :Rclass [g]−Rclass [g∗] 6 2

√Rreg [η]−Rreg [η∗]

Proof Let η : X → Y ⊂ R, and g(x) = 1(

η(x) > 12

), and let’s compute the excess risk of g. We

have,

Rclass [g]−Rclass [g∗] = EP [(g(X)− g∗(X)) (1− 2η∗(X))] .

Since g and g∗ are both indicator functions and, therefore, take only the values 0 and 1, theirdifference will be nonzero if and only if one of them is equal to 1 and the other one is equal to0. This leads to

Rclass[g]−Rclass ≤ EP

[1(η(X) 6 1/2 < η∗(X)

)∣∣2η∗(X)− 1∣∣]

+EP[1(η∗(X) 6 1/2 < η(X)

)∣∣2η∗(X)− 1∣∣]

= 2EP[1(1/2 ∈ [η∗(X), η(X)]

)∣∣η∗(X)− 1/2∣∣]

If η(X) 6 1/2 and η∗(X) > 1/2, then∣∣η∗(X)− 1/2

∣∣ 6 ∣∣η∗(X)− η(X)∣∣, and thus :

Rclass[g]−Rclass

[g∗]6 2EP

[1(1/2 ∈ [η(X), η∗(X)]

)∣∣η(X)− η∗(X)∣∣]

6 2EP[∣∣η(X)− η∗(X)

∣∣]6 2

√EP[(

η(X)− η∗(X))2]

= 2√Rreg(η)−Rreg(η∗).

Since this inequality is true for every deterministic η, we get the desired property. �

6