The Elements of Statistical Learning
description
Transcript of The Elements of Statistical Learning
![Page 1: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/1.jpg)
1
The Elements of Statistical Learning
Thomas Lengauer, Christian Merkwirth
using the book by Hastie, Tibshirani, Friedman
![Page 2: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/2.jpg)
2
The Elements of Statistical Learning
Prerequisites: Vordiplom in mathematics or
computer science or equivalent
Linear algebra Basic knowledge in statistics
Credits: Übungsschein, based on
At least 50% of points in homework
Final exam, probably oral
Good for Bioinformatics CS Theory or Applications
Time Lecture: Wed 11-13,
HS024, Building 46 (MPI) Tutorial: Fri 14-16
Rm. 15, Building 45 (CS Dep.)Biweekly, starting Oct. 31
![Page 3: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/3.jpg)
3
1. Introduction
![Page 4: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/4.jpg)
4
Applications of Statistical Learning
Medical: Predicted whether a patient, hospitalized due to a heart attack, will have a second heart attackData: demographic, diet, clinical measurements
Business/Economics: Predict the price of stock 6 months from now.Data: company performance, economic data
Vision: Identify hand-written ZIP codesData: Model hand-written digits
Medical: Amount of glucose in the blood of a diabeticData: Infrared absorption spectrum of blood sample
Medical: Risk factors for prostate cancerData: Clinical, demographic
![Page 5: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/5.jpg)
5
Types of Data
Two basically different types of data Quantitative (numerical): e.g. stock price Categorical (discrete, often binary): cancer/no cancer
Data are predicted on the basis of a set of features (e.g. diet or clinical measurements) from a set of (observed) training data on these features For a set of objects (e.g. people). Inputs for the problems are also called predictors or independent
variables Outputs are also called responses or dependent variables
The prediction model is called a learner or estimator (Schätzer). Supervised learning: learn on outcomes for observed features Unsupervised learning: no feature values available
![Page 6: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/6.jpg)
6
Example 1: Email Spam
Data: 4601 email messages
each labeled email (+) or spam (-)
The relative frequencies of the 57 most commonly occurring words and punctuation marks in the message
Prediction goal: label future messages email (+) or spam (-)
Supervised learning problem on categorical data: classification problem
spam email
george 0.00 1.27
you 2.26 1.27
your 1.38 0.44
hp 0.02 0.90
free 0.52 0.07
hpl 0.01 0.43
! 0.51 0.11
our 0.51 0.18
re 0.13 0.42
edu 0.01 0.29
remove 0.28 0.01
Words with largest difference between spam and email shown.
![Page 7: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/7.jpg)
7
Example 1: Email Spam
Examples of rules for prediction: If (%george<0.6) and
(%you>1.5) then spam else email
If (0.2·%you-0.3 ·%george)>0 then spam else email
Tolerance to errors: Tolerant to letting through
some spam (false positive) No tolerance towards throwing
out email (false negative)
spam email
george 0.00 1.27
you 2.26 1.27
your 1.38 0.44
hp 0.02 0.90
free 0.52 0.07
hpl 0.01 0.43
! 0.51 0.11
our 0.51 0.18
re 0.13 0.42
edu 0.01 0.29
remove 0.28 0.01
Words with largest difference between spam and email shown.
![Page 8: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/8.jpg)
8
Example 2: Prostate Cancer
Data (by Stamey et al. 1989): Given:
lcavol log cancer volumelweight log prostate weightagelbph log benign hyperplasia
amountsvi seminal vesicle invasionlcp log capsular penetrationgleason gleason scorepgg45 percent gleason
scores 4 or 5
Predict:PSA (prostate specific antigen) level
Supervised learning problem on quantitative data: regression problem
![Page 9: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/9.jpg)
9
Example 2: Prostate Cancer
Figure shows scatter plots of the input data, projected onto two variables, respectively.
The first row shows the outcome of the prediction, projected onto each input variable, respectively.
The variables svi and gleason are categorical
![Page 10: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/10.jpg)
10
Example 3: Recognition of Handwritten Digits
Data: images are single digits 16x16 8-bit gray-scale, normalized for size and orientation
Classify: newly written digits
Non-binary classification problem
Low tolerance to misclassifications
![Page 11: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/11.jpg)
11
Example 4: DNA Expression Microarrays
Data: Color intensities signifying the
abundance levels of mRNA for a number of genes (6830) in several (64) different cell states (samples).
Red over-expressed gene Green under-expressed gene Black normally expressed gene
(according to some predefined background)
Predict: Which genes show similar
expression over the samples Which samples show similar
expression over the genes (unsupervised learning problem)
Which genes are highly over or under expressed in certain cancers (supervised learning problem)
samples
genes
![Page 12: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/12.jpg)
12
2. Overview of Supervised Learning
![Page 13: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/13.jpg)
13
2.2 Notation
Inputs X, Xj (jth element of vector X ) p #inputs, N #observations X matrix written in bold Vectors written in bold xi if they
have N components and thus summarize all observations on variable Xi
Vectors are assumed to be column vectors
Discrete inputs often described by characteristic vector (dummy variables)
Outputs quantitative Y qualitative G (for group)
Observed variables in lower case The i-th observed value of X is xi
and can be a scalar or a vector
Main question of this lecture:
Given the value of an input vector X, make a good prediction Ŷ of the output Y
The prediction should be of the same kind as the searched output (categorical vs. quantitative)Exception: Binary outputs can be approximated by values in [0,1], which can be interpreted as probabilitiesThis generalizes to k-level outputs.
![Page 14: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/14.jpg)
14
2.3.1 Simple Approach 1: Least-Squares
Given inputsX=(X1,X2,…,Xp)
Predict output Y via the model
Include the constant variable 1 in X
Here Y is scalar (If Y is a K-vector then X is a
pxK matrix)
In the (p+1)-dimensional input-output space, (X,Ŷ ) represents a hyperplane
If the constant is included in X, then the hyperplane goes through the origin
is a linear function
is a vector that points in the steepest uphill direction
bias ˆ
ˆˆˆ
0
10
j
p
jjXY
ˆ TXY
TXXf )(
)(Xf
![Page 15: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/15.jpg)
15
2.3.1 Simple Approach 1: Least-Squares
In the (p+1)-dimensional input-output space, (X,Ŷ ) represents a hyperplane
If the constant is included in X, then the hyperplane goes through the origin
is a linear function
is a vector that points in the steepest uphill direction
TXXf )(
)(Xf
![Page 16: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/16.jpg)
16
2.3.1 Simple Approach 1: Least-Squares
Training procedure: Method of least-squares
N = #observations Minimize the residual sum of
squares
Or equivalently
This quadratic function always has a global minimum, but it may not be unique
Differentiating w.r.t. yields the normal equations
If XTX is nonsingular, then the unique solution is
The fitted value at input x is
The entire surface is characterized by .
2
1
)()( Ti
N
ii xyRSS
)()()( XyXy TRSS
0)( XyXT
yXXX TT 1)(ˆ
)(ˆ Txxy
![Page 17: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/17.jpg)
17
2.3.1 Simple Approach 1: Least-Squares
Example: Data on two inputs X1 and X2
Output variable has values GREEN (coded 0) and RED
(coded 1) 100 points per class Regression line is defined by
Easy but many misclassifications if the problem is not linear X1
X2
5.0ˆ Tx
5.0ˆ Tx
5.0ˆ Tx
![Page 18: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/18.jpg)
18
2.3.2 Simple Approach 2: Nearest Neighbors
Uses those observations in the training set closest to the given input.
Nk(x) is the set of the k closest points to x is the training sample
Average the outcome of the k closest training sample points
Fewer misclassifications X1
X2
)(
1)(ˆ
xNxi
ki
yk
xY
15-nearest neighbor averaging
5.0ˆ Y
5.0ˆ Y
5.0ˆ Y
![Page 19: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/19.jpg)
19
2.3.2 Simple Approach 2: Nearest Neighbors
Uses those observation in the training set closest to the given input.
Nk(x) is the set of the k closest points to x is the training sample
Average the outcome of the k closest training sample points
No misclassifications: Overtraining
X1
X2
)(
1)(ˆ
xNxi
ki
yk
xY
1-nearest neighbor averaging
5.0ˆ Y
5.0ˆ Y
5.0ˆ Y
![Page 20: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/20.jpg)
20
2.3.3 Comparison of the Two Approaches
Least squares K-nearest neighbors
![Page 21: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/21.jpg)
21
2.3.3 Comparison of the Two Approaches
Least squares p parameters
p = #features
K-nearest neighbors Apparently one parameter k
In fact N/k parametersN = #observations
![Page 22: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/22.jpg)
22
2.3.3 Comparison of the Two Approaches
Least squares p parameters
p = #features
Low variance (robust)
K-nearest neighbors Apparently one parameter k
In fact N/k parametersN = #observations
High variance (not robust)
![Page 23: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/23.jpg)
23
2.3.3 Comparison of the Two Approaches
Least squares p parameters
p = #features
Low variance (robust) High bias (rests on strong
assumptions)
K-nearest neighbors Apparently one parameter k
In fact N/k parametersN = #observations
High variance (not robust) Low bias (rests only on weak
assumptions)
![Page 24: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/24.jpg)
24
2.3.3 Comparison of the Two Approaches
Least squares p parameters
p = #features
Low variance (robust) High bias (rests on strong
assumptions) Good for Scenario 1:
Training data in each class generated from a two-dimensional Gaussian, the two Gaussians are independent and have different means
K-nearest neighbors Apparently one parameter k
In fact N/k parametersN = #observations
High variance (not robust) Low bias (rests only on weak
assumptions) Good for Scenario 2:
Training data in each class from a mixture of 10 low-variance Gaussians, with means again distributed as Gaussian (first choose the Gaussian, then choose the point according to the Gaussian)
![Page 25: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/25.jpg)
25
Result of 10 000 classifications
2.3.3 Origin of the Data
Mixture of the two scenariosStep 1: Generate 10 means mk from
the bivariate Gaussian distribution N((1,0)T,I) and label this class GREEN
Step 2: Similarly, generate 10 means from the from the bivariate Gaussian distribution N((0,1)T,I) and label this class RED
Step 3: For each class, generate 100 observations as follows: For each observation, pick an mk
at random with probability 1/10 Generate a point according to
N(mk,I/ 5)
Similar to scenario 2
![Page 26: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/26.jpg)
26
2.3.3 Variants of These Simple Methods
Kernel methods: use weights that decrease smoothly to zero with distance from the target point, rather than the 0/1 cutoff used in nearest-neighbor methods
In high-dimensional spaces, some variables are emphasized more than others
Local regression fits linear models (by least squares) locally rather than fitting constants locally
Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models
Projection pursuit and neural network models are sums of nonlinearly transformed linear models
![Page 27: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/27.jpg)
27
2.4 Statistical Decision Theory
Random input vector: Random output variable: Joint distribution: Pr(X,Y ) We are looking for a function
f(x) for predicting Y given the values of the input X
The loss function L(Y,f(X)) shall penalize errors
Squared error loss:
Expected prediction error (EPE):
Since Pr(X,Y )=Pr(Y |X )Pr(X ) EPE can also be written as
Thus it suffices to minimize EPE pointwise
pX Y
2))(())(,( XfYXfYL
),Pr())((
))((E)(EPE2
2
dydxxfy
XfYf
)|)](([EE)(EPE 2| XXfYf XYX
)|]([Eminarg)( 2| xXcYxf XYc
)|(E)( xXYxf Regression function:
![Page 28: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/28.jpg)
28
2.4 Statistical Decision Theory
Nearest neighbor methods try to directly implement this recipe
Several approximations Since no duplicate
observations, expectation over a neighborhood
Expectation approximated by averaging over observations
With increasing k and number of observations the average gets (provably) more stable
But often we do not have large samples
By making assumptions (linearity) we can reduce the number of required observations greatly.
With increasing dimension the neighborhood grows exponentially. Thus the rate of convergence to the true estimator (with increasing k) decreases
)|(E)( xXYxf Regression function:
))(|(Ave)(ˆ xNxyxf kii
![Page 29: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/29.jpg)
29
2.4 Statistical Decision Theory
Linear regression Assumes that the regression
function is approximately linear
This is a model-based approach
After plugging this expression into EPE and differentiating w.r.t. , we can solve for
Again, linear regression replaces the theoretical expectation by averaging over the observed data
Summary: Least squares assumes that
f(x) is well approximated by a globally linear function
Nearest neighbors assumes that f(x) is well approximated by a locally constant function.
)|(E)( xXYxf Regression function:
Txxf )(
),(E)],(E[
))()(E()(EPE1 YXXX
XYXYfT
TTT
yXXX TT
Ti
N
ii xyRSS
1
2
1
)(ˆ
)()(
![Page 30: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/30.jpg)
30
2.4 Statistical Decision Theory
Additional methods in this book are often model-based but more flexible than the linear model.
Additive models
Each fj is arbitrary
What happens if we use another loss function?
In this case
More robust than the conditional mean
L1 criterion not differentiable
Squared error most popular
)()(1
j
p
jj XfXf
|)(|))(,(1 XfYXfYL
)|(median)(ˆ xXYxf
![Page 31: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/31.jpg)
31
2.4 Statistical Decision Theory
Procedure for categorical output variable G with values from G
Loss function is kxk matrix L where k = card(G)
L is zero on the diagonal L(k, ℓ) is the price paid for
misclassifying a element from
class Gk as belonging to class
Gℓ
Frequently 0-1 loss function used: L(k, ℓ) = 1-kl
Expected prediction error (EPE)
Expectation taken w.r.t. the joint distribution Pr(G,X)
Conditioning yields
Again, pointwise minimization suffices
Or simply
))](ˆ,([EEPE XGGL
),Pr()](ˆ,[EEPE1
XXGL k
K
kkX GG
)|Pr(),(minarg)(ˆ1
xXgLXG k
K
kkg
GGG
)|Pr(max
)|Pr( if )(ˆ
xXg
xXXG
g
kk
G
GGBayes Classifier
![Page 32: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/32.jpg)
32
2.4 Statistical Decision Theory
Expected prediction error (EPE)
Expectation taken w.r.t. the joint distribution Pr(G,X)
Conditioning yields
Again, pointwise minimization suffices
Or simply
))](ˆ,([EEPE XGGL
),Pr()](ˆ,[EEPE1
XXGL k
K
kkX GG
)|Pr(),(minarg)(ˆ1
xXgLXG k
K
kkg
GGG
)|Pr(max
)|Pr( if )(ˆ
xXg
xXXG
g
kk
G
GGBayes-optimal decision boundary
Bayes Classifier
![Page 33: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/33.jpg)
33
2.5 Local Methods in High Dimensions
Curse of Dimensionality: Local neighborhoods become increasingly global, as the number of dimension increases
Example: Points uniformly distributed in a p-dimensional unit hypercube.
Hypercubical neighborhood in p dimensions that captures a fraction r of the data Has edge length ep(r) = r1/p
e10(0.01) = 0.63
e10(0.1) = 0.82
To cover 1% of the data we must cover 63% of the range of an input variable
![Page 34: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/34.jpg)
34
2.5 Local Methods in High Dimensions
To cover 1% of the data we must cover 63% of the range of an input variable
Reducing r reduces the number of observations and thus the stability
![Page 35: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/35.jpg)
35
2.5 Local Methods in High Dimensions
In high dimensions, all sample points are close to the edge of the sample
N data points uniformly distributed in a p-dimensional unit ball centered at the origin
Median distance from the closest point to the origin (Homework)
d(10,500) = 0.52 More than half the way to the
boundary
Sampling density is proportional to N1/p
If N1 = 100 is a dense sample for one input then N10 = 10010 is an equally dense sample for 10 inputs.
pN
Npd
/1/1
2
11),(
median
1/2 1/2
![Page 36: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/36.jpg)
36
2.5 Local Methods in High Dimensions
Another example
T set of training points xi generated uniformly in [-1,1]p (red)
Functional relationship between X and Y (green)
No measurement error Error of a 1-nearest neighbor
classifier in estimating f(0) (blue)
28)( XeXfY
10 training points
closest training point
prediction
![Page 37: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/37.jpg)
37
telescoping
2.5 Local Methods in High Dimensions
Another example Problem deterministic:
Prediction error is the mean-squared error for estimating f(0)
20 0 0
20 0
20 0
20 0
ˆ( ) E [ ( ) ]
ˆ ˆ E [ E ( )]
ˆ [E ( ) ( )]
ˆ ˆ Var ( ) Bias ( )
MSE x f x y
y y
y f x
y y
T
T T
T
T T
10 training points
closest training point
prediction
downward bias
![Page 38: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/38.jpg)
38
Side Calculation: Bias-Variance Decomposition
2 20 0 0 0 0 0 0
2 20 0 0 0
0 0 0 0
const. Bias
ˆ ˆ ˆ ˆ( ) E [( ( ) ) ] E [( E ( ) E ( ) ( )) ]
ˆ ˆ ˆ E [ E ( )] [E ( ) ( )]
ˆ ˆ ˆ 2E [( E ( )) (E ( ) ( ))]
Bi
MSE x f x y y y y f x
y y y f x
y y y f x
T T T T
T T T
T T T
20 0 0 0 0
0
ˆ ˆ ˆ ˆ ˆas ( ) Var ( ) 2 Bias ( ) E [( E ( ))]y y y y y
T T T T T
![Page 39: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/39.jpg)
39
2.5 Local Methods in High Dimensions
Another example 1-d vs. 2-d Bias increases
1-d bias
2-d bias
20 0 0
20 0
20 0
20 0
ˆ( ) E [ ( ) ]
ˆ ˆ E [ E ( )]
ˆ [E ( ) ( )]
ˆ ˆ Var ( ) Bias ( )
MSE x f x y
y y
y f x
y y
T
T T
T
T T
![Page 40: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/40.jpg)
40
2.5 Local Methods in High Dimensions
The case on N=1000 training points
Average
Variance does not increase since
function symmetric around 0
Bias increases
since distance of nearest neighbour increases
![Page 41: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/41.jpg)
41
2.5 Local Methods in High Dimensions
Yet another example 31 )1(
2
1)( XXfY
Variance increases since function is not symmetric around 0 Bias
increases moderately
since function is monotonic
![Page 42: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/42.jpg)
42
2.5 Local Methods in High Dimensions
Assume now a linear relationship with measurement error
We fit the model with least squares, for arbitrary test point x0
Additional variance since output nondeterministic
Variance depends on x0
If N is large we get
Variance negligible for large N or small
No bias Curse of dimensionality
controlled
2, ~ (0, )TY X N
0 0 0 01
10 0
ˆˆ ( )
( ) is the th element )of (
NT T
i ii
Ti
y x x x
x Xi X X x
0 0
20 | 0 0
2 20 0 0 0 0 0
20 0 0 0
2 1 2 20 0
ˆEPE( ) E E ( )
ˆ ˆ Var( | ) E [ E ( )] [E ( ) E ( )]
ˆ ˆ Var( | ) Var( ) Bias ( )
E [ ( ) ] 0
y x
T T
x y y
y x y y y y
y x y y
x x
X X
T
T T T T
T
220 )/()(EPEE
0 Npxx
Additional variance
![Page 43: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/43.jpg)
43
2.5 Local Methods in High Dimensions
More generally
Sample size: N = 500
Linear case EPE(LeastSquares)
is slightly above 1no bias
EPE(1-NN) always above 2, grows slowlyas nearest training point strays from target
)1,0(~ uniform, ,)( NXXfY
EPE Ratio 1-NN/Least Squares
2/)1()(
)(3
1
1
xxf
xxf
![Page 44: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/44.jpg)
44
2.5 Local Methods in High Dimensions
More generally
Sample size: N = 500
Cubic Case EPE(LeastSquares)
is biased, thus ratio smaller
)1,0(~ uniform, ,)( NXXfY
2/)1()(
)(3
1
1
xxf
xxf
EPE Ratio 1-NN/Least Squares
![Page 45: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/45.jpg)
45
2.6 Statistical Models
NN methods are the direct implementation of
But can fail in two ways With high dimensions NN need not be close to the target point If special structure exists in the problem, this can be used to
reduce variance and bias
)|(E)( xXYxf
![Page 46: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/46.jpg)
46
2.6.1 Additive Error Model
Assume additive error model
Then Pr(Y|X) depends only on the conditional mean of f(x)
This model is a good approximation in many cases
In many cases, f(x) is deterministic and error enters through uncertainty in the input. This can often be mapped on uncertainty in the output with deterministic input.
X
XfY
oft independen
0)(E
)(
![Page 47: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/47.jpg)
47
2.6.2 Supervised Learning
Supervised learning The learning algorithm
modifies its input/output relationship in dependence on the observed error
This can be a continuous process
)(ˆii xfy
![Page 48: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/48.jpg)
48
2.6.3 Function Approximation
Data: pairs (xi,yi) that are points in (p+1)-space
More general input spaces are possible
Want a good approximation of f(x) in some region of input
space, given the training set T Many models have certain
parameters E.g. for the linear model
f(x)=xT and
Linear basis expansions have the more general form
Examples Polynomial expansions:
hk(x) = x1x22
Trigonometric expansions hk(x) = cos(x1)
Sigmoid expansion
K
kkk xhxf
1
)()(
)exp(1
1)(
kTk x
xh
: , ( )pi i iF y f x
![Page 49: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/49.jpg)
49
2.6.3 Function Approximation
Approximating fby minimizing the residual sum of squares
Linear basis expansions have the more general form
Examples Polynomial expansions:
hk(x) = x1x22
Trigonometric expansions hk(x) = cos(x1)
Sigmoid expansion
K
kkk xhxf
1
)()( 2
1
))(()( i
N
ii xfyRSS
)exp(1
1)(
kTk x
xh
![Page 50: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/50.jpg)
50
2.6.3 Function Approximation
Approximating fby minimizing the residual sum of squares
Intuition f surface in (p+1)-space Observe noisy realizations Want fitted surface as close to
the observed points as possible Distance measured by RSS
Methods: Closed form: if basis function
have no hidden parameters Iterative: otherwise
2
1
))(()(RSS i
N
ii xfy
![Page 51: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/51.jpg)
51
2.6.3 Function Approximation
Approximating fby maximizing the likelihood
Assume an independently drawn random sample yi , i = 1,…,N from a probability density Pr(y). The log-probability of observing the sample is
Set to maximize L(
)(Prlog)(1
i
N
i
yL
![Page 52: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/52.jpg)
52
2.6.3 Function Approximation
Approximating fby maximizing the likelihood
Assume an independently drawn random sample yi , i = 1,…,N from a probability density Pr(y). The log-probability of observing the sample is
Set to maximize L(
)(Prlog)(1
i
N
i
yL
Least squares with the additive error model
is equivalent to maximum likelihood with the likelihood function
This is, because in this case the log-likelihood function is
),0(~
)(2
N
XfY
)),((~),|Pr( 2 XfNXY
2
12
))((2
1
log)2log(2
)(
i
N
ii xfy
NN
L
proportional to RSS
![Page 53: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/53.jpg)
53
2.6.3 Function Approximation
Approximating the regression function Pr(G|X)by maximizing the likelihood for a qualitative output G.
Conditional probability of each class given X
Then the log-likelihood, also called the cross-entropy, is
)(log)(1
, i
N
ig xpLi
kkxpxXG kk ,,1 ),()|Pr( , G
![Page 54: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/54.jpg)
54
2.7 Structured Regression Models
Problem with regression: Minimizing
has infinitely many (interpolating) solutions
If we have repeated outcomes at each point, we can use them to decrease the variance by better estimating the average
Otherwise restrict the set of functions to “smooth” functions Choice of set is model choice Major topic of this course
2
1
))(()(RSS i
N
ii xfy
Restricting function spaces: Choose function space of low
complexity Close to constant, linear or low-order
polynomial in small neighborhoods VC dimension is a relevant complexity
measure in this context Estimator does averaging or local
polynomial fitting The larger the neighborhood, the
stronger the constraint Metric used is important, either
explicitly or implicitly defined All such methods run into problems
with high dimensions, therefore need metrics that allow neighborhoods to be large in at least some dimensions
![Page 55: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/55.jpg)
55
2.8 Classes of Restricted Estimators
Bayesian Methods Formula for joint probabilities
Bayes Formula
RSS is penalized with a roughness penalty
J(f) is large for ragged functions E.g. cubic smoothing spline is
the solution for the least-squares problem
Large second derivative is penalized
)()(RSS);(PRSS fJff
dxxf
xfyf i
N
ii
2
2
1
)]([
))(();(PRSS
Pr( | ) Pr( )Pr( | ) =
Pr( )
X Y YY X
X
Prior (probability) for Y
Posterior (probability) for Y
Pr( , ) Pr( | ) Pr( )
= Pr( | ) Pr( )
X Y Y X X
X Y Y
![Page 56: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/56.jpg)
56
2.8 Classes of Restricted Estimators
Introducing penalty functions is a type of regularization It works against overfitting It implements beliefs about
unseen parts of the problem
In Bayesian terms Penalty J is the log-prior
(probability distribution) PRSS is the log-posterior
(probability distribution)
RSS is penalized with a roughness penalty
J(f) is large for ragged functions E.g. cubic smoothing spline
for least squares
Large second derivative is penalized
)()(RSS);(PRSS fJff
dxxf
xfyf i
N
ii
2
2
1
)]([
))(();(PRSS
![Page 57: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/57.jpg)
57
Kernel functions model the local
neighborhoods in NN methods define the function space
used for approximation
Gaussian kernel
assigns weights to points that die exponentially with the square of the distance from the point x0
controls the variance
2.8.2 Kernel Methods and Local Regression
Simplest Kernel estimate:Nadaraya-Watson weighted average
General local regression estimate of f (x0) is
ˆ 0
20 0
1
ˆ( ) where minimizes
( , ) ( , )( ( ))N
i i ii
f x
RSS f x K x x y f x
N
i i
N
i ii
xxK
yxxKxf
1 0
1 00
),(
),()(ˆ
2exp
1),(
2
00
xxxxK
![Page 58: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/58.jpg)
58
2.8.2 Kernel Methods and Local Regression
fis a simple function such as a low-order polynomial f(x) = 0
Nadaraya-Watson estimate f(x) = 0+ 1x
local linear regression model
NN methods can be regarded as kernel methods with a special metric
Simplest Kernel estimate:Nadaraya-Watson weighted average
General local regression estimate of f(x0) is
ˆ 0
20 0
1
ˆ( ) where minimizes
( , ) ( , )( ( ))N
i i ii
f x
RSS f x K x x y f x
N
i i
N
i ii
xxK
yxxKxf
1 0
1 00
),(
),()(ˆ
true,
0
)(
0)(00
)(function indicator
from distancein
ranked sample g trainin
),(
b
k
kk
bII
x
kx
xxxxIxxK
![Page 59: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/59.jpg)
59
2.8.3 Basis Functions and Dictionary Methods
Include linear and polynomial expansions and more
General form
Linear in
)()(1
xhxf m
M
mm
![Page 60: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/60.jpg)
60
2.8.3 Basis Functions and Dictionary Methods
Include linear and polynomial expansions and more
General form
Linear in Examples Splines
Parameters are the points of attachment of the polynomials (knots)
)()(1
xhxf m
M
mm
![Page 61: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/61.jpg)
61
2.8.3 Basis Functions and Dictionary Methods
Include linear and polynomial expansions and more
General form
Linear in Examples Splines
Parameters are the points of attachment of the polynomials (knots)
Radial Basis Functions
Parameters are Centroids m
Scales m
)()(1
xhxf m
M
mm
M
mmm xKxf
m1
),()(
![Page 62: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/62.jpg)
62
2.8.3 Basis Functions and Dictionary Methods
Include linear and polynomial expansions and more
General form
Linear in Examples Splines
Parameters are the points of attachment of the polynomials (knots)
Radial Basis Functions
Parameters are Centroids m
Scales m
Neural Networks
)()(1
xhxf m
M
mm
M
mmm xKxf
m1
),()(
)1/(1)(
)()(1
x
mTm
M
mm
ex
bxxf
neuron output
neuron weight
![Page 63: The Elements of Statistical Learning](https://reader035.fdocuments.net/reader035/viewer/2022062410/56815a17550346895dc76008/html5/thumbnails/63.jpg)
63
2.9 Model Selection
Smoothing and complexity parameters Coefficient of the penalty term Width of the kernel Number of basis functions
The setting of the parameters implements a tradeoff between bias and variance
Example: k-NN methods
Assume that the values of the xi are fixed in advance
2)(Var
0)(E
)(
XfY
0 0 0
2 20 0
2 22
0 ( )1
ˆEPE ( ) E[ ( ) | ]
ˆ ˆ [Bias ( ( )) Var ( ( ))]
1 ( ) ( )
k k
k k
k
x Y f x X x
f x f x
f x f xk k
T
irreducible error mean-square error
Generalization error
underfit region overfit region