Supervised Learning: Overview...n Local regression fits piecewise linear models by locally weighted...
Transcript of Supervised Learning: Overview...n Local regression fits piecewise linear models by locally weighted...
-
1
Supervised Learning: Overview
-
2
Outline
n Supervised Learning
n Regression
n Classification
n Least Squares and N-N
procedures
n Terminology
n Decision Theory
Framework
n Local Methods in High
dimensions
n Curse of
dimensionality
n Bias/Variance
tradeoff
-
3
Outline-2
n Statistical Models
n Function Approximation
n Basis for function expansion
n Need for Structured Regression Models
n How to restrict function class
n Model Selection and Bias/Variance Tradeoff
-
4
Terminologyn Input(s) –measured
or preset (X) n Predictor var(s)n Independent var(s)
n Output(s) (Y) (G)n Responsen Dependent varn Target
n Types of variablesn Quantitative
{Infinite set}n Categorical {finite
set}n Group Labels n Codes (dummy vars)
n Ordered(no metric)n Small, medium,
large
-
5
Regression and Classificationn Both Tasks Similar
n Given the value of an input vector X, make a good prediction of the response Y.n Function approximationn y~f(x)
n Given n A set of Examples
(Training Set)
n A Performance Evaluation criteria e.g.,n Least Squares Errorn Classification error
n Find an Optimal Prediction Proceduren An Algorithmn Black boxn Analytic expression( , ), 1,...,i iY X i n=
-
6
Out of Sample Performancen Given x , prediction
(forecast) f(x) = for response yn
n Out of (training) sample Performance Crucialn Training set error small,
may imply good overall performance.
n However, over-fitting (interpolation) to training sample may lead to a really bad overall performance.
n Treat classification as regression problem, e.g., n Denote the binary
coded target as Y,n Treat as quantitative
output. n If predictions are
in [0,1], then
Ŷ
ˆIf Y (G), Y should be in (G) too∈R R
Ŷ
ˆ ˆˆIf y > 0.5, G =1, otherwise G =0.
-
7
Supervised Learning - Classification
n Discriminant Analysis (DA)n Linear, Quadratic, Flexible, Penalized, Mixture
n Logistic Regressionn Support Vector Machines (SVM)n K-Nearest Neighbors (NN)n Adaptive k-NNn Bayesian Classificationn Genetic Algorithms
-
8
Supervised Learning – Classification and Regression
n Linear Models, GLM, Kernel methodsn Generalized Additive Models (Hastie & Tibshirani, 1990)n Decision Trees
n CART (Classification and Regression Trees) (Breiman, etc. 1984)n MARS (Multivariate Adaptive Regression Splines) (Freiman, 1990)n QUEST (Quick, Unbiased, Efficient Statistical Tree) (Loh, 1997)
n Decision Forestsn Bagging (Breiman, 1996)n Boosting (Freund and Schapire, 1997)n MART (Multiple Additive Regression Trees) (Freiman, 1999)
n Neural Networks (Adaptive Non-linear Models)
-
9
Least Squares and Nearest Neighbors
n Linear model fit by Least Squaresn Makes huge structural assumption
n a linear relationship,
n yields stable but possibly inaccurate predictionsn Method of k-nearest Neighbors
n Makes very mild structural assumptionsn points in close proximity in the feature space have
similar responses (needs a distance metric)
n Its predictions are often accurate, but can be unstable
-
10
Least Squaresn Linear Model n Intercept : Bias in machine learningn Include a constant variable in X.n In matrix notation, , an inner
product of x and .n In (p+1) dimensional Input-output
space, is a hyperplane, including the origin.
∑=
+=p
jjjXf
10)( ββX
0β
ˆˆ TY X β=
β̂
ˆ( , )X Y
-
11
Least Squares
n Choose the coefficient vector to minimize Residual Sum of Squares RSS(β)
n Differentiating wrt β:n If XTX is non-singular,
∑ ∑∑= ==
−−=−N
i
p
jjiji
N
iii xyxfy
1 1
20
1
2 )())(( ββ
ˆ( ) 0TX y X β− =1ˆ ( )T TX X X yβ −=
-
12
Least Squares-Geometrical Insight
-
13
Linear model applied to Classification
n Figure 2.1: Classification example.
n The classes are coded as a binary variable—GREEN = 0, RED = 1—and then fit by linear regression.
n The line is the decision boundary defined by xTß = 0.5.
n The red shaded region denotes that part of input space, classified as RED, while the green region is classified as GREEN.
-
14
Nearest Neighborsn Nearest Neighbor
methods use those observations in the training set closest in the input space to x.
n K-NN fit for ,
n k-NN requires a parameter k and a distance metric.
n For k = 1, training error is zero, but test error could be large (saturated model).
n As k , training error tends to increase, but test error tends to decrease first, and then tends to increase.
n For a reasonable k, both the training and test errors could be smaller than the Linear decision boundary.
Ŷ
( )
1ˆ( )i k
ix N x
Y x yk ∈
= ∑
↑
-
15
NN Example
-
16
K vs misclassification errorn How to choose k?
Cross-validationn Bayes Error: If the
underlying joint distribution was known (lowest expected loss)
n Test Error could be larger or smaller than the training error.
n Popular methods are variants of one of these methods
-
17
Decision Theory Frameworkn Inputs-Output are
random variablesn INPUTS (FEATURES)
Vector:n Real Valued OUTPUT: Yn Joint prob. dist., Pr(Y,X)
is unknown.n Seek a function for
predicting Y. given values of input X.
n Loss (Cost) function L(Y, f(X)): penalizing errors in prediction
n Most common and convenient Squared Error Loss: L2 loss
n Criterion for Choosing f:n Minimize Expected Loss
pX R∈
2( , ( )) ( ( ))L Y f X Y f X= −
2
2
( , ( )) ( ( ))
( ( )) Pr( , ) ( )
E L Y f X E Y f X
y f x dx dy ESPE f
= −
− =∫
( )f X
-
18
Decision Theory Framework-2n Iterated expectation
rule
n Suffices to minimize ESPE pointwise for each X=x
n Optimal solution
n Conditional Mean of Y given X=xn Regression function
n Best Predictor of Y for any point x: Conditional mean, if one wishes to minimize Average squared prediction error. n Optimal Solution is
criterion dependentn Average Absolute
Prediction error- L1 loss
2|( ) [ {( ( )) | }]X Y XESPE f E E y f x X= −
2|( ) argmin {( ) | }]c Y Xf x E y c X x= − =
( ) [ | ]f x E Y X x= =
-
19
Linear Regression versus NNn Linear Regression- Assumed
Model:n Thenn Corresponding solution
may not be conditional mean, if our assumption is wrong!
n Not conditioning on x. n Estimates based on
pooling over all x’s, assuming a parametric model for .
n NN-methods attempt to estimate the regression, assuming only that the responses for all x’s in a small neighborhood are close.
n Typically, we have at most one observation at any particular point. So
n Conditioning at a point relaxed to conditioning on a region close to the target point x.
( ) Tf x x β≈1[ ( )] ( )TE XX E XYβ −=
ˆ ( ) [ | ( )]i i kf x Ave y x N x= ∈
( )f X
-
20
Linear Regression and NNn In both approaches, the conditional expectation over
the population of x-values has been substituted by the average over the training sample.
n Empirical Risk Minimization (ERM) principle.n Least Squares assumes f(x) is well approximated by a
global linear function [low variance (stable estimates) , high bias].
n k-NN only assumes f(x) is well approximated by a locally constant function- Adaptable to any situation [high variance (decision boundaries change from sample to sample), low bias].
-
21
Popular Variations & Enhancements
n Kernel methods use weights that decrease smoothly to zero with the distance from the target point, rather than 0/1 weights used by k-NN methods.
n In high-dimensional spaces, kernels are modified to emphasize some features more than the othersn [variable (feature) selection]n Kernel design – possibly
kernel with compact support
n Local regression fits piecewise linear models by locally weighted least squares, rather than fitting constants locally.
n Linear models fit to a basis expansion of the measured inputs allow arbitrarily complex models.
n Projection pursuit and artificial neural network models consists of sums of non-linearly transformed linear models.
-
22
Categorical Output gn y-f(x): not meaningful error -
need a different loss fn.n When G has K categories,
the loss function can be expressed as a K x K matrix with 0 on the diagonal and non-negative elsewhere.
n L(k,j) is the cost paid for erroneously classifying an object in class k as belonging to class j.
n 0-1 loss used most often. All misclassifications cost the same unit amount.
n Exp. Prediction Error =
n As before, suffices to minimize EPE pointwise:
n For 0-1 loss, Bayes classifier, is same as choosing the most probable class, using the conditional distribution Pr(G|X). Its error rate is called Bayes rate. (Not same as Bayesian inference).
( , )ˆ[ ( , ( )]G XE L G G X
g1
ˆ ( ) argmin ( , )Pr( | )K
k kk
G x L g g g X x=
= =∑g
-
23
Bayes Classifier - Examplen Knowing the true joint
distribution in the simulated example, can get the Bayes optimal classifier.
n k-NN classifier approximates Bayes solution - a majority vote in a nearest nbd amounts to replacing the cond. prob. at a point with cond. prob. in a nbd and estimate this prob from the training sample proportion.
-
24
Classification via regression with dummy response variable n For the two class, code g by a binary Y, Y=1 if in
group 1, 0 otherwise, followed by squared error loss estimation.
n For the K-class problem, use K-dummy variables.n Classification to the class j with largest fitted value.n Exact representation, but with linear regression, the
fitted function may not be positive, and thus not an estimate of class probability for a given x.
n Modeling Pr(G|X) will be discussed in Chapter 4.
-
25
Local Methods in High Dimensions
n With a reasonably large set of training data, intuitively we should be able to find a fairly large neighborhood of observations close to any x
n Could estimate the optimal conditional expectation by averaging k-nearest neighbors.
n In high dimensions, this intuition breaks down. Points are spread sparsely even for N very largen This phenomena is called
“curse of dimensionality.”
n Input uniformly dist. on an unit hypercube in p-dimensionn Volume of a hypercube in
in p dimensions, with an edge size a is
n For a hypercubical nbd about a target point chosen at random to capture a fraction r of the observations, the expected edge length will be .
pa∝
1/( ) ppe r r=
-
26
Curse of dimensionality
-
27
Curse of dimensionality-2n As p increases, even for
a very small r, approaches 1 fast.n To capture 1% of the
data for local averaging,n For 10 (50) dim, 63%
(91%) of the range for each variable needs to be used.
n Such nbd are no longer local.
n Using very small r leads to very small k and a high variance estimate.
n Consequences of sampling points in high dimensionsn Sampling uniformly
within an unit hyperspheren Most points are close to
the boundary of the sample space.
n Prediction is much more difficult near the edges of the training sample –extrapolation rather than interpolation.
( )pe r
-
28
Curse of dimensionality-3n Sampling density prop. to N^(1/p)
n Thus if 100 obs in one dim are dense, the sample size required for same denseness in 10 dimensions is 100^10 (infeasible!)n In high dimensions, all feasible training samples sparsely
populate the sample space.n Bias-Variance trade-off phenomena for NN methods
depends on the complexity of the function, which can grow exponentially with the dimension.
n Consider the Examples for two models in Figures 2.7 and 2.8
-
29
Bias-variance of 1-NN True model complex
-
30
Bias-variance of 1-NN Small complexity model
-
31
Linear Model vs 1-NN
-
32
Summary-NN versus model based prediction
n By relying on rigid model assumptions, the linear model has no bias at all and small variance, while the error in 1-NN is substantially larger.
n If assumptions wrong, all bets are off and 1-NN may dominate
n Whole spectrum of models between rigid linear models and flexible 1-NN models, each with its own assumptions and biases to avoid exponential growth in complexity of functions in high dimensions by drawing heavily on these assumptions.
-
33
Statistical Models
n Assume response follows the modeln Y = Signal + noisen Signal =n In some problems noise
may be zero. Many problems in ML deal with learning a deterministic fn., but randomness enters through the x-location of training points
n Noise are iid not necessary, but useful for examining EPSE criterion
n In general, the conditional dist of Pr(Y|X) can depend on X in complicated manner, but additive error model precludes this.
n Additive error models not meaningful for qualitative output G.n Directly model the target
function Pr(G|X), e.g. Bernoulli trials for two class problem
( )f x
-
34
Supervised Learning
n Function fitting paradigm in MLn Error additive, Model n Supervised learning (learning f by example) through a
teacher.n Observe the system under study, both the inputs and
outputsn Assemble a training set T =
n Feed the observed input xi into a Learning algorithm, which produces
n Learning algorithm can modify its input/output relationship in response to the differences in output and fitted output.
n Upon completion of the process, hopefully the artificial and real outputs will be close enough to be useful for all sets of inputs likely to be encountered in practice.
( )y f x ε= +
( , ), 1,....,i ix y i N=
ˆ ( )if x
-
35
Function Approximation
n In statistics & applied math, the training set is considered as N points in (p+1)-dim Euclidean space
n The function f has p-dim input space as domain, and related to the data via the model
n The domain is n Goal: obtain useful
approx to f for all x in some region of n Assume that f is a
linear function of x’sn Expressed in terms
of a linear basis expansions( )y f x ε= +
pR
pR
1
( ) ( )K
k kk
f x h xθ θ=
= ∑
-
36
Bases and Criteria for function estimation
n The basis functions h(.) could be n Polynomial (Taylor Series
expansion)n Trignometric (Fourier
expansion)n Any other basis
(wavelets) n non-linear functions,
such as sigmoid function in neural network models
n Mini Residual SS (Least Square Error)n Closed form solution
n Linear modeln If the basis functions
do not involve any hidden parameters
n Otherwise, needn iterative methodsn numerical (stochastic)
optimization
-
37
Criteria for function estimationn More general
estimation methodn Max. Likelihood
estimation-Estimate the parameter so as to maximize the prob of the observed sample
n Least squares for Additive error model, with Gaussian noise, is the MLE using the conditional likelihood
n Multinomial likelihood for regression function Pr(G|X)n L is also called the cross-
entropy1
( ) ln Pr ( )N
ii
L yθθ=
= ∑
-
38
Need for Structured Regression Models
n RSS for an arbitrary function fn Infinitely many solutions : interpolation with any function
passing through the observed point is a solution [Over-fitting]
n Any particular solution chosen might be a poor approximation at test points different from the training set.
n Replications at each value of x – solution interpolates the weighted mean response at each point.
n If N were sufficiently large, so that repeats were guaranteed, and densely arranged, these solutions might tend to the conditional expectations.
n In order to obtain useful results fro finite N, the class of functions should be restricted
-
39
How to restrict the class of estimators?
n The restrictions may be encoded via parametric representation of f.
n Built into the learning algorithm
n Different restrictions lead to different unique optimal solutionn Infinitely many possible
restrictions, so the ambiguity transferred to the choice of restrictions.
n Generally, most learning methods: complexityrestrictions of some kindn Regularity of in
small nbd’s of x in some metric, such as special structuren Nealy constantn Linear or low order
polynomial behaviorn Estimate obtained by
averaging or fitting in that nbd.
n
ˆ ( )f x
-
40
Restrictions on function classn Nbd size dictate the
strength of the constraintsn Larger the nbd, the
stronger the constraint and more sensitive the solution to particular choice of constraint
n Nature of constraint depends on the Metricn Directly specified metric
and size of nbd.n Kernel and local
regression and tree based methods
n Splines, neural networks and basis-function methods implicitly definenbds of local behavior
-
41
Neighborhoods naturen Any method that attempts to produce locally
varying functions in small isotropic nbds will run into problems in high dimensions –curse of dimensionality.
n All method that overcome the dimensionality problems have an associated (implicit and adaptive) metric for measuring nbds, which basically does not allow the nbd to be simultaneously small in all directions.
-
42
Classes of Restricted Estimators
n Roughness penalty and Bayesian methodsn Penalized RSS
n RSS(f) + J(f)n User selected functional J(f)
large for functions that vary too rapidly over small regions of input space, e.g., cubic smoothing splinesn J(f) = integral of the
squared second derivative
n controls the amount of pemalty
n Kernel Methods and Local Regression provide estimates of the regression function or conditional expectation by specifying the nature of the local nbdn Gaussian Kerneln k-NN metricn Could also minimize kernel-
weighted RSS
n These methods need to be modified in high dimensions
λ
λ
-
43
Basis functions and dictionary methods
n Wide variety of flexible modelsn Model for f is a linear
expansion of basis functions
n Prescribed basis functions, polynomial basis of degree M.n One dimensional x, a
polynomial spline of degree K can be represented as an appropriate sequence of M spline basis functions, determined by M-K knots.n Piecewise polynomials
of degree K between the knots, joined with continuity of degree K-1 at the knots.
1
( ) ( )K
k kk
f x h xθ θ=
= ∑
-
44
Radial Basis functions and Neural Nets
n Symmetric p-dim kernels located at particular centroidsn Gaussian Kernel is popularn The centroids and scales have to be determinedn Splines have knots
n Data should dictate these quantitiesn Regression problem becomes combinatorially hard non-
linear problem. Need greedy algorithms or two stage procedures
n A single layer feed forward neural net model with linear output weights is an adaptive basis function method n known as dictionary methods, where one has available a
possibly infinite set or dictionary D of candidate basis functions to chose from, and model is built using a search mechanism.
-
45
Model Selection and Bias-Variance Tradeoff
n Smoothing or complexity parameter to be determined:n Multiplier of the
penalty termn The width of the
kerneln Number of basis
functions
n Can’t min residual ss without any restriction, since that will produce an overfitted(interpolated), that will not work for test set.
n The k-NN fitted regression, the training set fixed in advanced (non-random), illustrates the competing forces in model selection.
0ˆ ( )kf x
-
46
Model Selection and Bias-Variance Tradeoff
n Model with
n EPSE at x0 – Test or generalization error
EPSEk(x0) =
( )Y f X ε= +( ) 0E ε =
2 20 0
22 2
0 ( )1
ˆ ˆ[ ( ( ) ( ( ))]
1[ ( ) ( )]
k T k
k
ll
Bias f x Var f x
f x f xk k
σ
σσ
=
+ +
= + − +∑