1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007.

Model Assessment and Selection

Lecture Notes for Comp540 Chapter7

Jian Li

Mar.2007

• Model Selection

• Model Assessment

A Regression Problem

• y = f(x) + noise

• Can we learn f from this data?

• Let’s consider three methods...

Linear Regression

Quadratic Regression

Joining the dots

Which is best?

• Why not choose the method with the best fit to the data?

“How well are you going to predict future data drawn

from the same distribution?”

Model Selection and Assessment

• Model Selection: Estimating performances of different models to choose the best one (produces the minimum of the test error)

• Model Assessment: Having chosen a model, estimating the prediction error on new data

Why Errors

• Why do we want to study errors?

• In a data-rich situation split the data:Train Validation Test

Model Selection Model assessment

• But, that’s not usually the case

Overall Motivation

• Errors Measurement of errors (Loss functions) Decomposing Test Error into Bias & Variance

• Estimating the true error Estimating in-sample error (analytically )

AIC, BIC, MDL, SRM with VC Estimating extra-sample error (efficient sample reuse)

Cross Validation & Bootstrapping

Measuring Errors: Loss Functions

• Typical regression loss functions Squared error:

Absolute error:

Measuring Errors: Loss Functions

• Typical classification loss functions 0-1 Loss:

Log-likelihood (cross-entropy loss / deviance):

• We want to minimize generalization error or test error:

The Goal: Low Test Error

Err E[L(Y, ˆ f (X))]

NL(y i,

ˆ f (x i))i1

• But all we really know is training error:

• And this is a bad estimate of test error

Bias, Variance & Complexity

Training error can always be reduced when increasing model complexity, but risks over-fitting.

Typically

Decomposing Test Error

For squared-error loss & additive noise:

Irreducible error of target Y

Deviation of the average estimatefrom the true function’s mean

Expected squared deviation of ourestimate around its mean

2)(and0)(;)( VarEXfYModel:

Further Bias Decomposition

• For linear models (eg. Ridge), bias can be further decomposed:

* is the best fitting linear approximationAverage EstimationBias

For standard linear regression, Estimation Bias = 0

2))((minarg XXfE T

Average ModelBias

Model Fitting

Graphical representation of bias & variance

Hypothesis SpaceModel Space(basic linear regression)

Regularized Model Space(ridge regression)

Realization

Model Bias

EstimationBias

EstimationVariance

Shrunken fit

Closest fit(given our observation)

Closest fitIn population(if epsilon=0)

Bias & Variance Decomposition Examples

• kNN Regression

• Linear Regression

Linear weights on y:

Averaging over the training set:

NErr(x i)

N[ f (x i) Eˆ f (

x i)]2

Regression with squared error loss

Classification with 0-1 loss

Prediction error

Variance

-- + -- = -- -- + -- = --

-- + -- <> -- -- + -- <> -- Bias-Variance different for 0-1 loss than for squared error loss

Estimation errors on the right side of the boundary don’t hurt!

Simulated Example of Bias Variance Decomposition

Optimism of The Training Error Rate

• Typically: training error rate < true error

(same data is being used to fit the method and assess its error)

overly optimistic

Err E[L(Y, ˆ f (X))]

NL(y i,

ˆ f (x i))i1

Estimating Test Error

• Can we estimate the discrepancy between err and Err?

Adjustment for optimism of training error

extra-sample error

Errin --- In-sample error:Expectation over N new

responses at each xi

Optimism

Summary: 1

in y i ii

Err E err Cov y yN

• For linear fit with d indep inputs/basis funcs:

optimism linearly with # d Optimism as training sample size

22in yErr E err d

for squared error, 0-1 and other loss functions:

op Cov y yN

optimism: in yop Err E err

Ways to Estimate Prediction Error

• In-sample error estimates: AIC BIC MDL SRM

• Extra-sample error estimates: Cross-Validation

• Leave-one-out

• K-fold

Bootstrap

• General form of the in-sample estimate:

• For linear fit : 22

ˆ , so called statisticp p

dC err C

poerrin ˆrrE

Estimates of In-Sample Prediction Error

AIC & BIC

Similarly: Akaike Information Criterion (AIC)

Nloglik 2

Bayesian Information Criterion (BIC)

dNCBI )(logloglik2

AIC & BIC

AIC LL(Data | MLE params) (# of parameters)

BIC LL(Data | MLE params) logN

2(# of parameters)

MDL(Minimum Description Length)

• Regularity ~ Compressibility

• Learning ~ Finding regularities

Learning model

Real model

InputSamples

PredictionsR1

Real classR1

MDL(Minimum Description Length)

• Regularity ~ Compressibility

• Learning ~ Finding regularities

length logPr(y |, M, X) logPr( | M)

Length of transmitting the discrepancy

given the model + optimal coding

under the given model

Description of the model

under optimal coding

MDL principle: choose the model with the minimum description length

Equivalent to maximizing the posterior:

Pr(y |, M,X)Pr( | M)

SRM with VC (Vapnik-Chernovenkis) Dimension

• Vapnik showed that with probability 1-

Errtrue Errtrain 2 1 1

4 Errtrain

As h increases

A method of selecting a class F from a family of nested classes

h VC dimension (measure of f 's power)

where a1

h log(a2N /h) 1 log( /4)

Errin Estimation

• A trade-off between the fit to the data and the model complexity

AIC err 2d

dNCBI )(logloglik2

MDL length logPr(y |,M,X) logPr( | M)

VC : Errtrue Errtrain 2 1 1

4 Errtrain

Estimation of Extra-Sample Err

• Cross Validation

• Bootstrap

CV () 1

NL y i,

ˆ f (i)(x i,) i1

……

test train

K-fold

Cross-Validation

How many folds?

Variance decreases bias decreases

Computation increases

k increases

k foldLeave-one-out

Cross-Validation: Choosing K

Popular choices for K: 5,10,N

Generalized Cross-Validation

• LOOCV can be computational expensive for linear fitting with large N

• GCV provides a computationally cheaper approximation

• Linear fitting

• For linear fitting under squared-error loss:

ˆ y Sy (S is a smoother matrix)

Nyi ˆ f i(xi)

yi ˆ f (xi)

Sii i'th diagonal element of S

trace(S)N

yi ˆ f (xi)

Bootstrap: Main Concept

“The bootstrap is a computer-based method of statistical inference that can answer many real statistical questions without formulas”

(An Introduction to the Bootstrap, Efron and Tibshirani, 1993)

Step 1: Draw samples with replacement

Step 2: Calculate the statistic

How is it coming

Sampling distribution of sample mean

x In practice cannot afford

large number of random samples

The theory tells us the sampling distribution

The sample stands for the populationand the distribution of in many resamples stands for the sampling

distribution

Bootstrap: Error Estimation with Errboot

Vˆ a r[S(Z)] 1

B 1S(Z*b ) S

BS(Z*b )

Var ˆ F [S(Z)] Depends on the unknown true

distribution F

A straightforward application of bootstrap to error prediction

Eˆ r rboot 1

NL(y i,

ˆ f *b (x i))i1

Bootstrap: Error Estimation with Err(1)

A CV-inspired improvement on Errboot

Eˆ r r(1) 1

C iL(y i,

ˆ f *b (x i))b C i

Bootstrap: Error Estimation with Err(.632)

An improvement on Err(1) in light-fitting cases

Eˆ r r(.632) .368err .632Eˆ r r(1)

N # of datapoints Z (z1,...,zn )

Probability of zi NOT being chosen when 1 point is uniformly sampled from Z : 1 - 1

Probability of z i NOT being chosen when Z is sampled N times : 1 - 1

Probability of zi being chosen AT LEAST once when Z is sampled N times: 1 1 - 1

1 e 1 Eˆ r r(.632) err .632(Eˆ r r(1) err)

0.632 .368err .632Eˆ r r(1)

Bootstrap: Error Estimation with Err(.632+)

An improvement on Err(.632) by adaptively accounting for overfitting

• Depending on the amount of overfitting, the best error estimate is as little as Err(.632) , or as much as Err(1), or something in between

• Err(.632+) is like Err(.632) with adaptive weights, with Err(1) weighted at least .632

• Err(.632+) adaptively mixes training error and leave-one-out error using the relative overfitting rate (R)

Bootstrap: Error Estimation with Err(.632+)

Eˆ r r(.632) ranges from Eˆ r r(.632) if there is minimal overfitting (R0),

to Eˆ r r(1) if there is maximal overfitting (R1)

Cross Validation & Bootstrap

• Why bother with cross-validation and bootstrap when analytical estimates are known?

1) AIC, BIC, MDL, SRM all requires knowledge of d, which is difficult to attain in most situations.

2) Bootstrap and cross validation gives similar results to above but also applicable in more complex situation.

3) Estimating the noise variance requires a roughly working model, cross validation and bootstrap will work well even if the model is far from correct.

Conclusion

• Test error plays crucial roles in model selection• AIC, BIC and SRMVC have the advantage that you only need the

training error• If VC-dimension is known, then SRM is a good method for model

selection – requires much less computation than CV and bootstrap, but is wildly conservative

• Methods like CV, Bootstrap give tighter error bounds, but might have more variance

• Asymptotically AIC and Leave-one-out CV should be the same• Asymptotically BIC and a carefully chosen k-fold should be the

same • BIC is what you want if you want the best structure instead of the

best predictor• Bootstrap has much wider applicability than just estimating

prediction error

1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007.

Documents

Transcript of 1 Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007.

Mktg5 Chapter7

Chapter7 v2

The Global Strategy and Teamwork for Periodontal Health and Overall Health - Li-Jian Jin

jian li men tu

HU Wen-dong 1,2 SHEN Tong-li 3 DING Jian-jun 1 YANG You-lin 1 LIU Jian-jun 1

Jian Dong and Qingxia Li Antenna Array Design in Aperture

Spatial Database Bryan Dennie Jian Huang Jianghong Li Judy Mays GISC 6383 GIS Management & Implementation.

DU Jian , DU Li-juan , and Hu Hong-juan

Price of Stability Li Jian Fudan University May, 8 th,2007 Introduction to.

Capella Shanghai, Jian Ye Li Facts at a Glance Shanghai... · Capella Shanghai, Jian Ye Li Facts at a Glance ADDRESS 480 West Jianguo Road Xuhui District Shanghai 200031 China T:

Chapter7 Percent

Geotextiles and Geomembranes - ResearchGate · Failure mechanism of geosynthetic-encased stone columns in soft soils under embankment Jian-Feng Chen a, Liang-Yong Li a, Jian-Feng

RACE: Time Series Compression with Rate Adaptivity and Error Bound for Sensor Networks Huamin Chen, Jian Li, and Prasant Mohapatra Presenter: Jian Li.

FINANCIAL ECONOMETRICS Feb.17, 2003 SUN LI JIAN. INTRODUCTION.

Presented by Michihiro Kawano Tian(Fiona ) Luo Shaoyang Jian Cindy Wong Sisi Li.

Savi chapter7

Mission chapter7

Gene Expression Data Analysis Lab Session CAD course Jian Li 01.28. 2011.

Li, Jian; Paraoanu, G.S.; Cicak, Katarina; Altomare, Fabio ... · Dynamical Autler-Townes control of a phase qubit Jian Li 1, G. S. Paraoanu , Katarina Cicak 2, Fabio Altomare *,

Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.