Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff...

174
Statistical Machine Learning a.k.a. High Dimensional Inference Larry Wasserman Carnegie Mellon University

Transcript of Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff...

Page 1: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Statistical Machine Learninga.k.a.

High Dimensional InferenceLarry Wasserman

Carnegie Mellon University

Page 2: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Outline

1 High Dimensional Regression and Classification

2 Graphical Models

3 Nonparametric Methods

4 Clustering

Throughout: Strong assumptions versus weak assumptions

Dimension = p or d. Sample size = n.

References:The Elements of Statistical Learning. Hastie, Tibshirani and Freedman.Statistical Machine Learning. Liu and Wasserman (coming soon).

2

Page 3: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Introduction

• Machine learning is statistics with a focus on prediction,scalability and high dimensional problems.

• Regression: predict Y ∈ R from X .

• Classification: predict Y ∈ 0,1 from X .

I Example: Predict if an email X is real Y = 1 or spam Y = 0.

• Finding structure. Examples:

I Clustering: find groups.

I Graphical Models: find conditional independence structure.

3

Page 4: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Three Main Themes

Convexity

Convex problems can be solved quickly. If necessary,approximate the problem with a convex problem.

Sparsity

Many interesting problems are high dimensional. But often, therelevant information is effectively low dimensional.

Weak Assumptions

Make the weakest possible assumptions.

4

Page 5: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Preview: Sparse Graphical Models

Preview: Finding relations between stocks in the S&P 500:

5

Page 6: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Regression

We observe pairs (X1,Y1), . . . , (Xn,Yn).

D = (X1,Y1), . . . , (Xn,Yn) is called the training data.

Yi ∈ R is the response. Xi ∈ Rp is the feature (or covariate).

For example, suppose we have n subjects. Yi is the blood pressure ofsubject i . Xi = (Xi1, . . . ,Xip) is a vector of p = 5,000 gene expressionlevels for subject i .

Remember: Yi ∈ R and Xi ∈ Rp.

Given a new pair (X ,Y ), we want to predict Y from X .

6

Page 7: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Regression

Let Y be a prediction of Y . The prediction error or risk is

R = E(Y − Y )2

where E is the expected value (mean).

The best predictor is the regression function

m(x) = E(Y |X = x) =

∫y f (y |x)dy .

However, the true regression function m(x) is not known. We need toestimate m(x).

7

Page 8: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Regression

Given the training data D = (X1,Y1), . . . , (Xn,Yn) we want toconstruct m to make

prediction risk = R(m) = E(Y − m(X ))2

small. Here, (X ,Y ) are a new pair.

Key fact: Bias-variance decomposition:

R(m) =

∫bias2(x)p(x)dx +

∫var(x)p(x) + σ2

where

bias(x) = E(m(x))−m(x)

var(x) = Variance(m(x))

σ2 = E(Y −m(X ))2

8

Page 9: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Bias-Variance Tradeoff

Prediction Risk = Bias2 + Variance

Prediction methods with low bias tend to have high variance.

Prediction methods with low variance tend to have high bias.

For example, the predictor m(x) ≡ 0 has 0 variance but will be terriblybiased.

To predict well, we need to balance the bias and the variance. Webegin with linear methods.

9

Page 10: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Linear Regression

Try to find the best linear predictor, that is, a predictor of the form:

m(x) = β0 + β1x1 + · · ·+ βpxp.

Important: We do not assume that the true regression function islinear.

We can always define x1 = 1. Then the intercept is β1 and we canwrite

m(x) = β1x1 + · · ·+ βpxp = βT x

where β = (β1, . . . , βp) and x = (x1, . . . , xp).

10

Page 11: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Low Dimensional Linear Regression

Assume for now that p (= length of each Xi ) is small. To find a goodlinear predictor we choose β to minimize the training error:

training error =1n

n∑i=1

(Yi − βT Xi)2

The minimizer β = (β1, . . . , βp) is called the least squares estimator.

11

Page 12: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Low Dimensional Linear Regression

The least squares estimator is:

β = (XTX)−1XTY

where

Xn×d =

X11 X12 · · · X1dX21 X22 · · · X2d

......

......

Xn1 Xn2 · · · Xnd

and

Y = (Y1, . . . ,Yn)T .

In R: lm(y ∼ x)

12

Page 13: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Low Dimensional Linear Regression

Summary: the least squares estimator is m(x) = βT x =∑

j βjxjwhere

β = (XTX)−1XTY.

When we observe a new X , we predict Y to be

Y = m(X ) = βT X .

Our goals are to improve this by:

(i) dealing with high dimensions(ii) using something more flexible than linear predictors.

13

Page 14: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

Y = HIV resistance

Xj = amino acid in position j of the virus.

Y = β0 + β1X1 + · · ·+ β100X100 + ε

14

Page 15: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

0 10 30 50 70

−0.

4−

0.2

0.0

0.2

0.4

position

β

0 10 30 50 70

−300

−200

−100

0

100

200

300

position

α

−2

−1

0

1

2

fitted values

resi

dual

s

0 20 40 60

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

position

β

Top left: βTop right: marginal regression coefficients (one-at-a-time)Bottom left: Yi − Yi versus Yi

Bottom right: a sparse regression (coming up soon)15

Page 16: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

High Dimensional Linear Regression

Now suppose p is large. We even might have p > n (more covariatesthan data points).

The least squares estimator is not defined since XTX is not invertible.The variance of the least squares prediction is huge.

Recall the bias-variance tradeoff:

Prediction Error = Bias2 + Variance

We need to increase the bias so that we can decrease the variance.

16

Page 17: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Ridge Regression

Recall that the least squares estimator minimizes the training error1n∑n

i=1(Yi − βT Xi)2.

Instead, we can minimize the penalized training error:

1n

n∑i=1

(Yi − βT Xi)2 + λ‖β‖22

where ‖β‖2 =√∑

j β2j .

The solution is:β = (XTX + λI)−1XTY

17

Page 18: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Ridge Regression

The tuning parameter λ controls the bias-variance tradeoff:

λ = 0 =⇒ least squares.λ =∞ =⇒ β = 0.

We choose λ to minimize R(λ) where R(λ) is an estimate of theprediction risk.

18

Page 19: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Ridge Regression

To estimate the prediction risk, do not use training error:

Rtraining =1n

n∑i=1

(Yi − Yi)2, Yi = X T

i β

because it is biased: E(Rtraining) < R(β)

Instead, we use leave-one-out cross-validation:

1. leave out (Xi ,Yi)

2. find β

3. predict Yi : Y(−i) = βT Xi

4. repeat for each i

19

Page 20: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Leave-one-out cross-validation

R(λ) =1n

n∑i=1

(Yi − Y(i))2 =

1n

n∑i=1

(Yi − Yi)2

(1− Hii)2

≈ Rtraining(1− k

n

)2

≈ Rtraining −2 k σ2

n

where

H = X(XTX + λI)−1XT

k = trace(H)

k is the effective degrees of freedom20

Page 21: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

Y = 3X1 + · · ·+ 3X5 + 0X6 + · · ·+ 0X1000 + ε

n = 100, p = 1,000.

So there are 1000 covariates but only 5 are relevant.

What does ridge regression do in this case?

21

Page 22: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Ridge Regularization Paths

0 2000 4000 6000 8000 10000

−2

02

λ

β

22

Page 23: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparse Linear Regression

Ridge regression does not take advantage of sparsity.

Maybe only a small number of covariates are important predictors.How do we find them?

We could fit many submodels (with a small number of covariates) andchoose the best one. This is called model selection.

Now the inaccuracy is

prediction error = bias2 + variance

The bias is the errors due to omitting important variables. Thevariance is the error due to having to estimate many parameters.

23

Page 24: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Bias-Variance Tradeoff

0 20 40 60 80 100

Var

ianc

e

Number of Variables

0 20 40 60 80 100

Bia

s

Number of Variables

24

Page 25: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Bias-Variance Tradeoff

This is a Goldilocks problem. Can’t use too few or too many variables.

Have to choose just the right variables.

Have to try all models with one variable, two variables,...

If there are p variables then there are 2p models.

Suppose we have 50,000 genes. We have to search through 250,000

models. But 250,000 > number of atoms in the universe.

This problem is NP-hard. This was a major bottleneck in statistics formany years.

25

Page 26: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

You are Here

P

NP

NP−Complete

NP−Hard X

Variable Selection

26

Page 27: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Two Things that Save Us

Two key ideas to make this feasible are sparsity and convexrelaxation.

Sparsity: probably only a few genes are needed to predict somedisease Y . In other words, of β1, . . . , β50,000 most βj ≈ 0.

But which ones?? (Needle in a haystack.)

Convex Relaxation: Replace model search with something easier.

It is the marriage of these two concepts that makes it all work.

27

Page 28: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparsity

Look at this:β = (5,5,5,0,0,0, . . . ,0).

This vector is high-dimensional but it is sparse.

Here is a less obvious example:

β = (50,12,6,3,2,1.4,1,0.8.,0.6,0.5, . . .)

It turns out that, if the βj ’s die off fairly quickly, then β behaves a like asparse vector.

28

Page 29: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparsity

We measure the (lack of) sparsity of β = (β1, . . . , βp) with the q-norm

‖β‖q =(|β1|q + · · ·+ |βp|q)1/q =

(∑j

|βj |q)1/q

.

Which values of q measure (lack of) sparsity?

sparse: a = 1 0 0 0 · · · 0not sparse: b = .001 .001 .001 .001 · · · .001

√ √ ×q = 0 q = 1 q = 2

‖a‖q 1 1 1‖b‖q p

√p 1

Lesson: Need to use q ≤ 1 to measure sparsity. (Actually, q < 2 ok.)29

Page 30: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparsity

So we estimate β = (β1, . . . , βp) by minimizing

n∑i=1

(Yi − [β0 + β1Xi1 + · · ·+ βpXip])2

subject to the constraint that β is sparse i.e. ‖β‖q ≤ small.

Can we do this minimization?

If we use q = 0 this turns out to be the same as searching through all2p models. Ouch!

What about other values of q?

What does the set β : ‖β‖q ≤ small look like?

30

Page 31: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The set ‖β‖q ≤ 1 when p = 2

q = 14 q = 1

2

q = 1 q = 32

31

Page 32: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparsity Meets Convexity

We need these sets to have a nice shape (convex). If so, theminimization is no longer NP-hard. In fact, it is easy.

Sensitivity to sparsity: q ≤ 1 (actually, q < 2 suffices)Convexity (niceness): q ≥ 1

This means we should use q = 1.

32

Page 33: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Where Sparsity and Convexity Meet

Sparsity

Convexity

0 1 2 3 4 5 6 7 8 9p

33

Page 34: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparsity Meets Convexity: The LASSO

So we estimate β = (β1, . . . , βp) by minimizing

n∑i=1

(Yi − [β0 + β1Xi1 + · · ·+ βpXip])2

subject to the constraint that β is sparse i.e. ‖β‖1 =∑

j |βj | ≤ small.

This is called the LASSO. Invented by Rob Tibshirani in 1996.

34

Page 35: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Lasso

The result is an estimated vector

β1, . . . , βp

Most are 0!

Magically, we have done model selection without searching (thanks tosparsity plus convexity).

The next picture explains why some βj = 0.

35

Page 36: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparsity: How Corners Create Sparse Estimators

36

Page 37: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso: HIV Example Again

• Y is resistance to HIV drug.• Xj = amino acid in position j of the virus.• p = 99, n ≈ 100.

37

Page 38: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso: An Example

0.0 0.2 0.4 0.6 0.8 1.0

−8

−6

−4

−2

02

4

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

LASSO

5325

5551

3617

2923

28

38

Page 39: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Selecting λ

We choose the sparsity level by estimating prediction error. (Ten-foldcross-validation.)

0 10 20 30 40 50 60

020

040

060

0

p

Ris

k

39

Page 40: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso: An Example

0 10 20 30 40 50 60 70

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

position

ββ

40

Page 41: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso: Another Example (using glmnet)

0 5 10 15

−1

01

23

4

L1 Norm

Coe

ffici

ents

0 4 6 12

41

Page 42: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso: Another Example (using glmnet)

−3 −2 −1 0 1

020

4060

log(Lambda)

Mea

n−S

quar

ed E

rror

14 14 13 12 12 11 10 9 8 8 8 6 5 5 4 4 3 2 0

42

Page 43: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparvexity

To summarize: we penalize the sums of squares with

‖β‖q =

∑j

|βj |q1/q

.

To get a sparse answer: q < 2.

To get a convex problem: q ≥ 1.

So q = 1 works.

Sparvexity (the marriage of sparsity and convexity) is one of thebiggest developments in statistics and machine learning.

43

Page 44: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso

• β(λ) is called the lasso estimator. Then define

S(λ) =

j : βj(λ) 6= 0

.

R: lars (y,x) or glmnet (y,x)

• (Optional) After you find S(λ), you can re-fit the model by doingleast squares on the sub-model S(λ). (Less bias but highervariance.)

44

Page 45: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso

Choose λ by risk estimation. (For example, 10-fold cross-validation.)

If you use the “re-fit” approach, then you can apply leave-one-outcross-validation:

R(λ) =1n

n∑i=1

(Yi − Y(i))2 =

1n

n∑i=1

(Yi − Yi)2

(1− Hii)2 ≈1n

RSS(1− s

n

)2

where H is the hat matrix and s = #j : βj 6= 0.

Choose λ to minimize R(λ).

45

Page 46: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso

The complete steps are:

1 Find β(λ) and S(λ) for each λ.2 Choose λ to minimize estimated risk.3 Prediction: Y = X T β.

46

Page 47: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Some Convexity Theory for the Lasso

Consider a simpler model than regression: Suppose Y ∼ N(µ,1). Letµ minimize

A(µ) =12

(Y − µ)2 + λ|µ|.

How do we minimize A(µ)?

• Since A is convex, we set the subderivative = 0. Recall that c is asubderivative of f (x) at x0 if

f (x)− f (x0) ≥ c(x − x0).

• The subdifferential ∂f (x0) is the set of subderivatives. Also, x0minimizes f if and only if 0 ∈ ∂f .

47

Page 48: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

`1 and Soft Thresholding

• If f (µ) = |µ| then

∂f =

−1 µ < 0[ -1 , 1 ] µ = 0+1 µ > 0.

• Hence,

∂A =

µ− Y − λ µ < 0µ− Y + λz : −1 ≤ z ≤ 1 µ = 0µ− Y + λ µ > 0.

48

Page 49: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

`1 and Soft Thresholding

• µ minimizes A(µ) if and only if 0 ∈ ∂A.

• So

µ =

Y + λ Y < −λ0 −λ ≤ Y ≤ λY − λ Y > λ.

• This can be written as

µ = soft(Y , λ) ≡ sign(Y ) (|Y | − λ)+.

49

Page 50: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

`1 and Soft Thresholding

−3 −2 −1 0 1 2 3

−2

−1

01

2

x

soft(

x)

− λ λ

50

Page 51: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Lasso: Computing β

• Minimize∑

i(Yi − βT Xi)2 + λ‖β‖1.

- use lars (least angle regression) or- coordinate descent: set β = (0, . . . ,0) then iterate the

following:• for j = 1, . . . ,d :• set Ri = Yi −

∑s 6=j βsXsi

• βj = least squares fit of Ri ’s in Xj .• βj ← soft(βj,LS, λ/

∑i X 2

ij )

R: glmnet

51

Page 52: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Variations on the Lasso

• Sparse logistic regression: later.

• Elastic Net: minimize

n∑i=1

(Yi − βT Xi)2 + λ1‖β‖1 + λ2‖β‖2

• Group Lasso:

β = (β1, . . . , βk︸ ︷︷ ︸v1

, . . . , βt , . . . , βp︸ ︷︷ ︸vm

)

minimize:n∑

i=1

(Yi − βT Xi)2 + λ

m∑j=1

‖vj‖

52

Page 53: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Summary

• For low dimensional (linear) prediction, we can use least squares.

• For high dimensional linear regression, we face a bias-variancetradeoff: omitting too many variables causes bias while includingtoo many variables causes high variance.

• The key is to select a good subset of variables.

• The lasso (L1-regularized least squares) is a fast way to selectvariables.

• If there are good, sparse, linear predictors, the lasso will workwell.

53

Page 54: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Weak Assumptions Versus Strong Assumptions

With virtually no assumptions, the following is true. Let β∗ minimize

E(Y − βT X )2 subject to ||β||1 ≤ L.

So βT∗ x is the best, sparse, linear predictor. Let β be the Lasso

estimator. Then

R(β)− R(β∗) = OP

(√L2

nlog(p)

).

The lasso is risk consistent as long as p = o(en). We do not assumethat the true model is linear. We make no assumptions on the designmatrix.

54

Page 55: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Strong Assumptions

To recover the true support of β or make inferences about β, manyauthors assume:

1 Incoherence (covariarates are almost independent).2 True model is linear.3 Constant variance.4 Donut (no small effects).

In signal processing, these assumptions are realistic. In dataanalysis, I can’t imagine why we would assume these things.

55

Page 56: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Inference

Approach I: The HARNESS (joint work with Ryan Tibshirani)

Split the data into two halves: D1 and D2.

Use D1 to choose the variables S.

Use D2 we can get confidence intervals for the βj ’s, prediction risk R,prediction risk Rj when dropping βj . For example,

R(S) =1n

∑D2

|Yi − βT Xi |

estimatesR(S) = E|Y − βT X |.

56

Page 57: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Inference From D2

Selected S ⊂ 1, . . . ,p using D1. Using D2 do usual least-squareslinear regression.

Can get confidence intervals for:

1 R = E|Y − βT X |.2 Rj = E|Y − βT

(−j)X | − E|Y − βT X |.3 βS = (βj : j ∈ S).

But notice that βS is the projection of m(x) onto (xj : j ∈ S). Itdepends on what variables are selected.

57

Page 58: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example (Wine Data): Confidence Intervals

−0.10 −0.05 0.00 0.05 0.10 −0.4 −0.2 0.0 0.2 0.4

Selected Variables: Alcohol, Volatile-Acidity, Sulphates,Total-Sulfur-Dioxide and pH.Left: Rj . Right: βj ’s.

58

Page 59: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sequential Approach

Approach II: Sequential. Test each variables as it is added in by thealgorithm:

See: Lockhart, Taylor, Tibshirani and Tibshirani (2013).(arxiv.org/abs/1301.7161 This will appear in the Annals of Statisticswith discussion.)

Other approaches: Javanmard and Montanari (2013), Buhlmann andvan der Geer (2013).

Warning: these approaches make strong assumptions.

59

Page 60: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Classification

Same as regression except that Yi ∈ −1,+1.

Classification riskR(h) = P(Y 6= h(X )).

Best classifier (Bayes’ classifier):

h(x) =

+1 if m(x) ≥ 1/2−1 if m(x) < 1/2

wherem(x) = P(Y = +1|X = x).

60

Page 61: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Linear Classifiers

h(x) =

+1 if β0 + βT x ≥ 0−1 if β0 + βT x < 0

h(x) = sign(β0 + βT x).

Minimizing the training error

1n

n∑i=1

I(Yi 6= h(Xi))

over β is not feasible (non-convex).

61

Page 62: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

(Sparse) Logistic Regression

m(x) = P(Y = 1|X = x) =exTβ

1 + exTβ

and, in the high-dimensional case, we minimize

− logL(β)− λ∑

j

|βj |

where the likelihood is

L(β) =∏

Yi=+1

m(Xi)∏

Yi=−1

(1−m(Xi)).

R: glmnet

62

Page 63: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Support Vector Machine (SVM)

Choose (β0, β) to minimize

1n

n∑i=1

[1− Yim(Xi)

]+

+ λ∑

j

β2j

where[ a ]+ = maxa, 0.

R: libsvm, e1071, ...

63

Page 64: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

SVM: Geometric View

+1

1

64

Page 65: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

SVM: Geometric View

The optimization problem can be written as:

minβ0,β

12‖β‖22 +

1λ′

n∑i=1

ξi

subject to ξi ≥ 1− Yi(β0 + X T

i β) and ξi ≥ 0 for all i = 1, . . . ,n. or:

α = argmaxα∈Rn

∑ni=1 αi − 1

2∑n

i=1∑n

k=1 αiαkYiYk 〈Xi ,Xk 〉

subject to∑n

i=1 αiYi = 0 and 0 ≤ α1, . . . , αn ≤ 1λ′ where

β =∑n

i=1 αiYiXi .

-quadratic program-changing 〈Xi ,Xk 〉 to K (Xi ,Xj) (kernelization) makes it nonlinear.

65

Page 66: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Surrogate Losses

h(x) = sign(m(x)) where m(x) = β0 + βT x .

The empirical classification risk is

R =1n

n∑i=1

I(Yi 6= h(Xi)) =1n

n∑i=1

I(Yim(Xi) ≤ 0).

Minimizing R is hard because it is non-convex. The idea is to replacethe non-convex function I(u ≤ 0) with a convex surrogate.

surrogate form classifierhinge [1− u]+ support vector machinelogistic log[1 + e−u] logistic regression

66

Page 67: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Surrogate Losses

yH(x)

L(yH(x))

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−2 −1 0 1 2

u

u

67

Page 68: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

South African Heart Disease. Outcome is Y=1,0(disease/non-disease). There are 9 predictors X1, . . . ,X9. I added100 extra (irrelevant) predictors.

n = 462. d = 109.

Use sparse logistic regression.

68

Page 69: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

−8 −7 −6 −5 −4 −3 −2

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

log(Lambda)

Mis

clas

sific

atio

n E

rror

108 106 104 101 100 89 77 67 50 29 13 5 4 3 1

−8 −7 −6 −5 −4 −3 −2

−0.

50.

00.

51.

0

Log Lambda

Coe

ffici

ents

106 101 93 69 34 5 1

69

Page 70: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Regression is Harder Than Classification

h(x) =

+1 if m(x) > 1/2−1 otherwise.

The risk of the plug-in classifier rule satisfies

R(h)− R∗ ≤√∫

(m(x)−m∗(x))2dPX (x),

where R∗ = infh R(h).

70

Page 71: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Regression is Harder Than Classification

m(x)

x = 01

+1

0

m(x)

x = 01

+1

0

71

Page 72: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example: Political Blogs

Features: word counts, and links.

72

Page 73: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example: Political Blogs

log(λ)

Err

ors

0.0

0.1

0.2

0.3

0.4

0.5

0.6

1.0 1.5 2.0 2.5 3.0 3.5

log(λ)

Err

ors

0.0

0.1

0.2

0.3

0.4

0.5

0.6

6 7 8 9 10 11 12

Sparse LR SVM

log(λ)

Err

ors

0.0

0.1

0.2

0.3

0.4

0.5

0.6

1.0 1.5 2.0 2.5 3.0 3.5 4.0

log(λ)

Err

ors

0.0

0.1

0.2

0.3

0.4

0.5

0.6

7 8 9 10 11

Sparse LR (+ links) SVM (+ links)

73

Page 74: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Undirected Graphs

UNDIRECTED GRAPHS

74

Page 75: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Undirected Graphs

An approach to representing multivariate data.

Unsupervised: no response variables.

Two main types:

1 Correlation Graphs2 Partial Correlation Graphs

75

Page 76: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Correlation Graphs

X = (X (1), . . . ,X (d)).Put an edge between j and k if θjk = Cov(X (j),X (k)) 6= 0.

Bootstrap method: (Wasserman, Kolar, Rinaldo arXiv:1309.6933).

Step 1: Estimate Σ:

S =1n

n∑i=1

(Xi − X )(Xi − X )T .

Step 2: Bootstrap: draw B bootstrap samples, each of size n, giving S∗1 , . . . ,S∗B . Find

upper quantile zα:

1B

B∑j=1

I

(max

r,s

√n|S∗j (r , s)− S(r , s)| > zα

)= α.

76

Page 77: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Correlation Graphs

Step 3: Put edge between (j , k) if

0 /∈ S(j , k)± zα√n.

ThenP(G 6= G) ≥ 1− α− C log d

n1/8 .

This follows from results in Wasserman, Kolar, Rinaldo (2013)arXiv:1309.6933 and Chernozhukov, Chetverikov and Kato (2012)arXiv:1212.6906.

77

Page 78: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example: n = 100 and d = 100.

78

Page 79: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Markov (Conditional Independence) Graphs

Let X = (X (1), . . . ,X (d)). A graph G = (V ,E) has vertices V , edgesE . Independence graph has one vertex for each X (j).

X Y Z

means thatX q Z

∣∣∣ Y

V = X ,Y ,Z and E = (X ,Y ), (Y ,Z ).

79

Page 80: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Markov Property

A probability distribution P satisfies the global Markov property withrespect to a graph G if:

for any disjoint vertex subsets A, B, and C such that C separates Aand B,

XA q XB

∣∣∣ XC .

80

Page 81: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

1 2 3 4 5

6 7 8

C = 3, 7 separates A = 1, 2 and B = 4, 8. Hence:

X1, X2! X4, X8

X3, X7.

1

C = 3,7 separates A = 1,2 and B = 4,8. Hence,

X (1),X (2) q X (4),X (8)∣∣∣ X (3),X (7)

81

Page 82: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example: Protein networks (Maslov 2002)

82

Page 83: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Distributions Encoded by a Graph

• I(G) = all independence statements implied by the graph G.

• I(P) = all independence statements implied by P.

• P(G) = P : I(G) ⊆ I(P).• If P ∈ P(G) we say that P is Markov to G.

• The graph G represents the class of distributions P(G).

• Goal: Given random vectors X1, . . . ,Xn ∼ P estimate G.

83

Page 84: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Gaussian Case

• If X ∼ N(µ,Σ) then there is no edge between Xi and Xj if andonly if

Ωij = 0

where Ω = Σ−1.

• GivenX1, . . . ,Xn ∼ N(µ,Σ).

• For n > p, letΩ = Σ−1

and testH0 : Ωij = 0 versus H1 : Ωij 6= 0.

84

Page 85: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Gaussian Case: p > n

Two approaches:

• parallel lasso (Meinshausen and Buhlmann)

• graphical lasso (glasso; Banerjee et al, Hastie et al.)

Parallel Lasso:

1 For each j = 1, . . . ,d (in parallel): Regress Xj on all othervariables using the lasso.

2 Put an edge between Xi and Xj if each appears in the regressionof the other.

85

Page 86: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Glasso (Graphical Lasso)

The glasso minimizes:

−`(Ω) + λ∑j 6=k

|Ωjk |

where`(Ω) =

12

(log |Ω| − tr(ΩS))

is the Gaussian loglikelihood (maximized over µ).

There is a simple blockwise gradient descent algorithm for minimizingthis function. It is very similar to the previous algorithm.

R packages: glasso and huge

86

Page 87: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Graphs on the S&P 500

• Data from Yahoo! Finance (finance.yahoo.com).

• Daily closing prices for 452 stocks in the S&P 500 between 2003and 2008 (before onset of the “financial crisis”).

• Log returns Xtj = log(St ,j/St−1,j

).

• Winsorized to trim outliers.

• In following graphs, each node is a stock, and color indicatesGICS industry.

Consumer Discretionary Consumer StaplesEnergy FinancialsHealth Care IndustrialsInformation Technology MaterialsTelecommunications Services Utilities

87

Page 88: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

S&P 500: Graphical Lasso

88

Page 89: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

S&P 500: Parallel Lasso

89

Page 90: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Choosing λ

Can use:

1 Cross-validation2 BIC = log-likelihood - (p/2) log n3 AIC = log-likelihood - p

where p = number of parameters.

90

Page 91: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Remarks

1 Versions for discrete random variables.2 Assumption Free Approach (Wasserman, Kolar and Rinaldo

2013). (Includes confidence intervals).3 Very active area, especially in bio-informatics.4 Nonparametric approaches (coming up).

91

Page 92: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

NONPARAMETRICMACHINE LEARNING

92

Page 93: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonparametric Regression

Given (X1,Y1), . . . , (Xn,Yn) predict Y from X .

Assume only that Yi = m(Xi) + εi where where m(x) is a smoothfunction of x .

The most popular methods are kernel methods. However, there aretwo types of kernels:

1 Smoothing kernels2 Mercer kernels

Smoothing kernels involve local averaging.Mercer kernels involve regularization.

93

Page 94: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Smoothing Kernels

• Smoothing kernel estimator:

mh(x) =

∑ni=1 Yi Kh(Xi , x)∑n

i=1 Kh(Xi , x)=∑

i

wi(x)Yi

where Kh(x , z) is a kernel such as

Kh(x , z) = exp(−‖x − z‖2

2h2

)and h > 0 is called the bandwidth.

• mh(x) is just a local average of the Yi ’s near x .

• The bandwidth h controls the bias-variance tradeoff:Small h = large variance while large h = large bias.

94

Page 95: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example: Some Data – Plot of Yi versus Xi

X

Y

95

Page 96: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example: m(x)

X

Y

96

Page 97: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

m(x) is a local average

97

Page 98: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Effect of the bandwidth h

very small bandwidth

small bandwidth

medium bandwidth

large bandwidth

98

Page 99: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Smoothing Kernels

Risk = E(Y − mh(X ))2 = bias2 + variance + σ2.

bias2 ≈ h4,

variance ≈ 1nhd where d = dimension of X .

σ2 = E(Y −m(X ))2 is the unavoidable prediction error.

small h: low bias, high variance (undersmoothing)large h: high bias, low variance (oversmoothing)

99

Page 100: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Risk Versus Bandwidth

h

Variance Bias

optimal h

Risk

100

Page 101: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Estimating the Risk: Cross-Validation

To choose h we need to estimate the risk R(h). We can estimate therisk by using cross-validation.

1 Omit (Xi ,Yi) to get mh,(i), then predict: Y(i) = mh,(i)(Xi).2 Repeat this for all observations.3 The cross-validation estimate of risk is:

R(h) =1n

n∑i=1

(Yi − Y(i))2.

Shortcut formula:

R(h) =1n

n∑i=1

(Yi − Yi

1− Lii

)2

where Lii = Kh(Xi ,Xi)/∑

t Kh(Xi ,Xt ).101

Page 102: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Summary so far

1 Compute mh for each h.2 Estimate the risk R(h).

3 Choose bandwidth h to minimize R(h).4 Let m(x) = mh(x).

102

Page 103: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

X

Y

0.05 0.10 0.15 0.20

0.01

20.

014

0.01

60.

018

h

Est

imat

ed R

isk

X

Y

103

Page 104: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Mercer Kernels (RKHS)

Choose m to minimize∑i

(Yi −m(Xi))2 + λ penalty(m)

where penalty(m) is a roughness penalty.

λ is a smoothing parameter that controls the amount of smoothing.

How do we construct a penalty that measures roughness? Oneapproach is: Mercer Kernels and RKHS = Reproducing Kernel HilbertSpaces.

104

Page 105: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

What is a Mercer Kernel?

A Mercer Kernel K (x , y) is symmetric and positive definite:∫ ∫f (x)f (y)K (x , y) dx dy ≥ 0 for all f .

Example: K (x , y) = e−||x−y ||2/2.

Think of K (x , y) as the similarity between x and y . We will create aset of basis functions based on K .

Fix z and think of K (z, x) as a function of x . That is,

K (z, x) = Kz(x)

is a function of the second argument, with the first argument fixed.

105

Page 106: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Mercer Kernels

Let

F =

f (·) =

k∑j=1

βj K (zj , ·)

Define a norm: ‖f‖K =∑

j∑

k βjβkK (zj , zk ). ‖f‖K small means fsmooth.

If f =∑

r αr K (zr , ·), g =∑

s βsK (ws, ·), the inner product is

〈f ,g〉K =∑

r

∑s

αrβsK (zr ,ws).

F is a reproducing kernel Hilbert space (RKHS) because

〈f ,K (x , ·)〉 = f (x)

106

Page 107: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonparametric Regression: Mercer Kernels

Representer Theorem: Let m minimize

J =n∑

i=1

(Yi −m(Xi))2 + λ‖m‖2K .

Then

m(x) =n∑

i=1

αi K (Xi , x)

for some α1, . . . , αn.

So, we only need to find the coefficients

α = (α1, . . . , αn).

107

Page 108: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonparametric Regression: Mercer Kernels

Plug m(x) =∑n

i=1 αiK (Xi , x) into J:

J = ||Y −Kα||2 + λαTKα

where Kjk = K (Xj ,Xk )

Now we find α to minimize J. We get: α = (K + λI)−1Y andm(x) =

∑i αiK (Xi , x).

The estimator depends on the amount of regularization λ. Again,there is a bias-variance tradeoff. We choose λ by cross-validation.This is like the bandwidth in smoothing kernel regression.

108

Page 109: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Smoothing Kernels Versus Mercer Kernels

Smoothing kernels: the bandwidth h controls the amount ofsmoothing.

Mercer kernels: norm ‖f‖K controls the amount of smoothing.

In practice these two methods give answers that are very similar.

109

Page 110: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Mercer Kernels: Examples

very small lambda

small lambda

medium lambda

large lambda

110

Page 111: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Multiple Regression

Both methods extend easily to the case where X has dimensiond > 1. For example, just use

K (x , y) = e−‖x−y‖2/2.

However, this is hard to interpret and is subject to the curse ofdimensionality. This means that the statistical performance and thecomputational complexity degrade as dimension d increases.

An alternative is to use something less nonparametric such asadditive model where we restrict m(x1, . . . , xd ) to be of the form:

m(x1, . . . , xd ) = β0 +∑

j

mj(xj).

111

Page 112: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Additive Models

Model: m(x) = β0 +∑d

j=1 mj(xj).

We can take β0 = Y and we will ignore β0 from now on.

We want to minimize

n∑i=1

(Yi −

(m1(Xi1) + · · ·+ md (Xid )

))2

subject to mj smooth.

112

Page 113: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Additive Models

The backfitting algorithm:

• Set mj = 0• Iterate until convergence:

• Iterate over j :• Ri = Yi −

∑k 6=j mk (Xik )

• mj ←− smooth(Xj ,R)

Here, smooth(Xj ,R) is any one-dimensional nonparametric regressionfunction.

R: glm

But what if d is large?

113

Page 114: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Sparse Additive Models (SpAM)

Ravikumar, Lafferty, Liu and Wasserman (JRSS, 2009).

MinimizeJ =

∑i

(Yi −

∑j

mj(Xij))2

+ λ∑

j

‖mj‖

subject to mj smooth. This is basically a nonparametric version of thelasso.

Solution: backfitting + soft thresholding: Iterate:

• Ri = Yi −∑

k 6=j mk (Xik )

• mj = smooth(R,Xj)

• mj ← mj

[1− λ

sj

]+

soft thresholding

• where s2j = 1

n∑

i mj(X 2ij )

114

Page 115: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example: Boston Housing Data

Predict house value Y from 10 covariates.

We added 20 irrelevant (random) covariates to test the method.

Y = house value; n = 506, d = 30.

Y = β0 + m1(crime) + m2(tax) + · · ·+ · · ·m30(X30) + ε.

Note that m11 = · · · = m30 = 0.

We choose λ by minimizing the estimated risk.

SpAM yields 6 nonzero functions. It correctly reports thatm11 = · · · = m30 = 0.

115

Page 116: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

L2 norms of fitted functions versus 1/λ

0.0 0.2 0.4 0.6 0.8 1.0

01

23

Com

pone

nt N

orm

s

177

56

38

104

116

Page 117: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Estimated Risk Versus λ

0.0 0.2 0.4 0.6 0.8 1.0

2030

4050

6070

80C

p

117

Page 118: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example Fits

0.0 0.2 0.4 0.6 0.8 1.0

−10

1020

l1=177.14

x1

m1

0.0 0.2 0.4 0.6 0.8 1.0−

1010

20

118

Page 119: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example Fits

0.0 0.2 0.4 0.6 0.8 1.0

−10

1020

l1=478.29

x8

m8

0.0 0.2 0.4 0.6 0.8 1.0−

1010

20

119

Page 120: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonparametric Classification

There are many nonparametric classification techniques. Becausethey are similar to regression, I will not go into great detail.

Two common methods:

Plug-in:

h(x) =

+1 m(x) ≥ 1/2−1 m(x) < 1/2

where m(x) is from nonparametric regression.

Kernelization:

Often, replacing inner-products 〈x , y〉 with a kernel K (x , y) converts alinear classifier into a nonparametric classifier. For example, thekernelized support vector machine.

120

Page 121: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

knn Classification

Given x , find the k -nearest neighbors: (X(1),Y(1)), . . . , (X(k),Y(k)).

Set h(x) = 1 of majority of Y(i) are 1’s. (0 otherwise.)

Choose k by cross-validation.

This IS a plug-in classifier. It corresponds to a kernel regression:

m(x) =

∑i Yi I(||x − Xi || ≤ h(x))∑

i I(||x − Xi || ≤ h(x))

where h(x) is distace to k-th nearest neighbor.

121

Page 122: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

knn Classification

−2 −1 0 1 2

−2

−1

01

2

122

Page 123: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

knn Classification

−2 −1 0 1 2

−2

−1

01

2

123

Page 124: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Kernelization

Recall the SVM:

α = argmaxα∈Rn

∑ni=1 αi − 1

2∑n

i=1∑n

k=1 αiαkYiYk 〈Xi ,Xk 〉

subject to∑n

i=1 αiYi = 0 and 0 ≤ α1, . . . , αn ≤ 1λ′ where

β =∑n

i=1 αiYiXi .

Replacing 〈Xi ,Xk 〉 with K (Xi ,Xj) (kernelization) makes itnonparametric. Why?

K (x , y) = 〈φ(x), φ(y)〉 where φ maps x into a very high dimensionalspace.

Kernelized (nonlinear) SVM is equivalent to a linear SVM in ahigh-dimesional space. Like adding interactions in regression.

Can kernelize any procedure that depends on inner-products. 124

Page 125: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

NONPARAMETRICGRAPHS

125

Page 126: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Regression vs. Graphical Models

assumptions regression graphical models

parametric lasso graphical lasso

nonparametric sparse additive model nonparanormal

126

Page 127: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Nonparanormal (Liu, Lafferty, Wasserman, 2009)

A random vector X = (X1, . . . ,Xp)T has a nonparanormal distribution

X ∼ NPN(µ,Σ, f )

in caseZ ≡ f (X ) ∼ N(µ,Σ)

where f (X ) = (f1(X1), . . . , fp(Xp)).

Joint density

pX (x) =1

(2π)p/2|Σ|1/2 exp−1

2(f (x)− µ)T Σ−1 (f (x)− µ)

p∏j=1

|f ′j (xj)|

• Semiparametric Gaussian copula

127

Page 128: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Examples

128

Page 129: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

The Nonparanormal

• Define hj(x) = Φ−1(Fj(x)) where Fj(x) = P(Xj ≤ x).

• Let Λ be the covariance matrix of Z = h(X ). Then

Xj q Xk

∣∣∣∣∣ Xrest

if and only if Λ−1jk = 0.

• Hence we need to:

1 Estimate hj(x) = Φ−1(Fj(x)).

2 Estimate covariance matrix of Z = h(X ) using the glasso.

129

Page 130: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Properties

• LLW (2009) show that the resulting procedure has the sametheoretical properties as the glasso, even with dimension dincreasing with n.

• If the nonparanormal is used when the data are actually Normal,little efficiency is lost.

• Implemented in R: huge

130

Page 131: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Gene-Gene Interactions for Arabidopsis thaliana

source: wikipedia.org

Dataset from Affymetrix microarrays,sample size n = 118, p = 40 genes(isoprenoid pathway).

131

Page 132: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example Results

NPN glasso difference

1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40 1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40 1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40

1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40 1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40 1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40

132

Page 133: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Transformations for 3 Genes

2 4 6 8

34

56

78

910

x5

6 7 8 9 10

8.0

8.5

9.0

9.5

10.0

10.5

11.0

x8

1 2 3 4 5

23

45

6

x18

• These genes have highly non-Normal marginal distributions.

• The graphs are different at these genes.

133

Page 134: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

S&P Data (2003–2008): Graphical Lasso

134

Page 135: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

S&P Data: Nonparanormal

135

Page 136: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

S&P Data: Nonparanormal vs. Glasso

common edges differences

136

Page 137: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Another approach: Forests

• Nonparanormal: Unrestricted graphs, semiparametric.

• We’ll now trade off structural flexibility for greaternonparametricity.

• Forests are graphs with no cycles.

• Force the graph to be a forest but make no distributionalassumptions.

137

Page 138: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Forest Densities (Gupta, Lafferty, Liu, Wasserman, Xu, 2011)

A distribution is supported by a forest F with edge set E(F ) if

p(x) =∏

(i,j)∈E(F )

p(xi , xj)

p(xi) p(xj)

∏k∈V

p(xk )

• For known marginal densities p(xi , xj), best tree obtained byminimum weight spanning tree algorithms.

• In high dimensions, a spanning tree will overfit.

• We prune back to a forest.

138

Page 139: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Step 1: Constructing a Full Tree

• Compute kernel density estimates

fn1(xi , xj) =1n1

∑s∈D1

1h2

2K

(X (s)

i − xi

h2

)K

X (s)j − xj

h2

• Estimate mutual informations

In1(Xi ,Xj) =1

m2

m∑k=1

m∑`=1

fn1(xki , x`j) logfn1(xki , x`j)

fn1(xki) fn1(x`j)

• Run Kruskal’s algorithm (Chow-Liu) on edge weights

139

Page 140: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Step 2: Pruning the Tree

• Heldout risk

Rn2(fF ) = −∑

(i,j)∈E

∫fn2(xi , xj) log

f (xi , xj)

f (xi) f (xj)dxidxj

• Selected forest given by

k = arg mink∈0,...,p−1

Rn2

(fT (k)

n1

)

where T (k)n1

is forest obtained after k steps of Kruskal

140

Page 141: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

S&P Data: Forest Graph—Oops! Outliers.

141

Page 142: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

S&P Data: Forest Graph

142

Page 143: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

S&P Data: Forest vs. Nonparanormal

common edges differences

143

Page 144: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Graph-Valued Regression

• (X1,Y1), . . . , (Xn,Yn) where Yi is high-dimensional

• We’ll discuss one particular version: graph-valued regression(Chen, Lafferty, Liu, Wasserman, 2010)

• Let G(x) be the graph for Y based on p(y |x)

• This defines a partition X1, . . . ,Xk where G(x) is constant overeach partition.

• Three methods to find G(x):I Parametric

I Kernel graph-valued regression

I GO-CART (Graph-Optimized CART)

144

Page 145: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Graph-Valued Regression

multivariate regression graphical model(supervised) (unsupervised)µ(x) = E(Y | x) Graph(Y ) = (V ,E)Y ∈ Rp, x ∈ Rq (j , k) 6∈ E ⇐⇒ Yj q Yk |Yrest

@@@R

graph-valued regressionGraph(Y | x)

• Gene associations from phenotype (or vice versa)

• Voting patterns from covariates on bills

• Stock interactions given market conditions, news items

145

Page 146: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Method I: Parametric

• Assume that Z = (X ,Y ) is jointly multivariate Gaussian.

• Σ =

ΣX ΣXYΣYX ΣY

.

• Get ΣX , ΣY , and ΣXY

• Get ΩX by the glasso.

• ΣY |X = ΣY − ΣYX ΩX ΣXY .

• But, the estimated graph does not vary with different values of X .

146

Page 147: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Method II: Kernel Smoothing

• Y |X = x ∼ N(µ(x),Σ(x)).

Σ(x) =

∑ni=1 K

(‖x−xi‖

h

)(yi − µ(x)) (yi − µ(x))T∑n

i=1 K(‖x−xi‖

h

)µ(x) =

∑ni=1 K

(‖x−xi‖

h

)yi∑n

i=1 K(‖x−xi‖

h

) .

• Apply glasso to Σ(x)

• Easy to do but recovering X1, . . . ,Xk requires difficultpost-processing.

147

Page 148: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Method III: Partition Estimator

• Run CART but use Gaussian log-likelihood (on held out data) todetermine the splits

• This yields a partition X1, . . . ,Xk (and a correspdonding tree)

• Run the glasso within each partition element

148

Page 149: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Simulated Data

1

2 3

4

5 6

7

8 9 10 11

12

13 14

15 16

17 18

19

20 21 22 23 24 25 26 27

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

X1<

0.5

X1>

0.5

X2<

0.5

X2>

0.5

X2<

0.5

X2>

0.5

X2<

0.25

X2>

0.25

X1<

0.75

X1>

0.75

X1<

0.25

X1>

0.25

X1<

0.25

X1>

0.25

X2<

0.75

X2>

0.75

X2<

0.75

X2>

0.75

X2<

0.125

X2>

0.125

X1<

0.375

X1>

0.375

X2<

0.625

X2>

0.625

X1<

0.875

X1>

0.875

X1<

0.125

X1>

0.125

X1<

0.125

X1>

0.125

X2<

0.375

X2>

0.375

X2<

0.375

X2>

0.375

X1<

0.625

X1>

0.625

X1<

0.625

X1>

0.625

X2<

0.875

X2>

0.875

X2<

0.875

X2>

0.875

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

101112

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

201

2

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20 12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

201

2

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20 12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

1920

5

6

13

14

17

18

28 29

30 31

32

33

34

35

36 37

38 39

40

41

42

43

0 5 10 15 20 2520.8

20.9

21

21.1

21.2

21.3

Splitting Sequence No.

Held

−out

Ris

k

(a)(c)

(b)

149

Page 150: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Climate Data

1

2

3

4

56

7

8

9

10

11

12

13

14

15

16

17

1819

20

21

22

23

24 25 26 27

28 29 30 31

32 33 34 35

36

37

38

39

40 41

42 43

4445

46 47

4849

5051

5253

5455

56 57

58 59 60 61

62 63

64

65

66

67

68

69

70 71 72 73

74 75 76 77

78 79 80 81

82 83 84 8586 87

1

2

3

4

5 6

7

8

9

10

11

12

13

14

15

16

17

18 19

20

21

22

23

24 25 26 27

28 29 30 31

32 33 34 35

36

37

38

39

40 41

42 43

44 45

46 47

48 49

50 51

52 53 54 55

56 57

58 59 60 61

62 63

64

65

66

67

68

69

70 71 72 73

74 75 76 77

78 79 80 81

82 83 84 85 86 87

CO2CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTR

TMN

TMP

TMX

GLO

ETR

ETRN

DIR

UV

CO2CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTR

TMN

TMP

TMX

GLO

ETR

ETRN

DIR

UVCO2

CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTR

TMN

TMP

TMX

GLO

ETR

ETRN

DIR

UVCO2

CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTR

TMN

TMP

TMX

GLO

ETR

ETRN

DIR

UV

(a)

(b)

(c)

150

Page 151: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

CLUSTERINGAND MIXTURES

151

Page 152: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Clustering (Unspervised Learning)

1 Parametric Clustering (mixtures):

p(x) =k∑

j=1

πj φ(x ;µj ,Σj)

2 Nonparametric Clustering: Estimate density p(x)nonparametrically then either:

-Construct a level set tree (cluster tree)

-or find the modes and their basins of attraction.

Others: k -means, hierarchical, ...

152

Page 153: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Mixtures

X1, . . . ,Xn ∼ p(x ; θ) =k∑

j=1

πj φ(x ;µj ,Σj)

Find θ to maximize

L(θ) =n∏

i=1

p(Xi ; θ)

usually by EM algorithm.

The likelihood is very non-convex and EM is not good. There are nowmany papers on using spectral methods and moment methods to toestimate high dimensional mixtures.

153

Page 154: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Mixtures

Latent cluster variable: Zi = (0,0,1,0,0,0,0,0) means Xi is fromcluster 3.

P(Xi in cluster j) =πjφ(Xi ; µj , Σj)∑k πkφ(Xi ; µk , Σk )

.

k is also a parameter. Usually chosen by penalized maximumlikelihood, for example, BIC: max

logL(θ)− dk

2log n

where dk is the number of parameters. (Theory is a bit dodgy, but seepapers by Watanabe and also Drton).

154

Page 155: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonparametric: Level Set Trees

Step 1: Estimate density:

p(x) =1n

n∑i=1

1hd K

( ||x − Xi ||h

)where K is a kernel (smooth symmetric density) and h > 0 is abandwidth.

Step 2: Find level sets: Lt = x : p(x) > λ.

Step 3: Find connected components of Lt i.e. Lt = C1⋃ · · ·⋃Cr .

155

Page 156: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonparametric: Level Set Trees

156

Page 157: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonparametric: Level Set Trees

As we vary λ, the connected components form a tree.

Software: R: denpro or python: DeBaCl written by Brian Kent:

https://github.com/CoAxLab/DeBaCl

The following examples were done by Brian Kent.

Thanks Brian! (He has done this successfully in d = 11,000dimensions.)

157

Page 158: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

158

Page 159: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

159

Page 160: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

160

Page 161: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Density Clustering II: Mode Clustering

Step I: Find p.

Step II: Find modes m1, . . . ,mk of p. Mean-shift algorithm: repeat:

x ←∑

i Xi K(||x−Xi ||

h

)∑

i K(||x−Xi ||

h

)

Step III: Find basins of attraction Cj (Morse complex) of each modemj . Specifically, x ∈ Cj if the gradient ascent path starting at x leadsto mj .

161

Page 162: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

162

Page 163: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

−2 −1 0 1 2

−2

−1

01

2

a

b

163

Page 164: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

164

Page 165: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

u1

u2

165

Page 166: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Example

166

Page 167: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Dimension Reduction

Recall: linear dimension reduction.

MDS (multidimensional scaling).

Y1, . . . ,Yn ∈ Rd .

Let φ be a linear projection onto R2 and let Zi = φ(Yi).

Minimize:

∑i,j

(||Yi − Yj ||2 − ||Zi − Zj ||2)

over all linear projections.

Solution: project onto the first two principal components u1,u2.167

Page 168: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonlinear Dimension Reduction

Many methods such as:

-Isomap

-LLE

-diffusion maps

-kernel PCA

+ many others

I will briefly explain diffusion maps.

(Coifman, Lafon, Lee, 2005)168

Page 169: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonlinear Dimension Reduction

Imagine a Markov chain starting at Xi and jumping to another point Xjwith probability

p(i , j) ∝ Kh(Xi ,Xj).

The transition matrix isM = D−1K

where K(i , j) = Kh(Xi ,Xj) and D is diagonal with Dii =∑

j Kh(Xi ,Xj).

The diffusion distance measures how easy it is to get from Xi and Xj .This is essentially Euclidean distance in a new space.

We can visualize this by plotting the eigenvectors of M.

169

Page 170: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonlinear Dimension Reduction

170

Page 171: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Nonlinear Dimension Reduction

171

Page 172: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

Requests

Causal Inference: let’s talk.

Counterexamples to Bayes: let’s talk (after a few drinks).

But see:

http://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/

http://normaldeviate.wordpress.com/2012/12/08/flat-priors-in-flatland-stones-paradox/

Also:

Owhadi, Scovel and Sullivan (2013). arXiv:1304.6772172

Page 173: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

What I Didn’t Cover

1 Bayes2 PCA, sparse PCA3 Nonlinear dimension reduction4 Directed graphs5 Convex optimization6 Semi-supervised methods7 Active learning8 Sequential learning9 ...

173

Page 174: Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff Prediction Risk = Bias2 + Variance Prediction methods withlow biastend to havehigh

THE END

174