Supervised Learning: Overview...n Local regression fits piecewise linear models by locally weighted...

1

Supervised Learning: Overview

2

Outline

n Supervised Learning

n Regression

n Classification

n Least Squares and N-N

procedures

n Terminology

n Decision Theory

Framework

n Local Methods in High

dimensions

n Curse of

dimensionality

n Bias/Variance

tradeoff

3

Outline-2

n Statistical Models

n Function Approximation

n Basis for function expansion

n Need for Structured Regression Models

n How to restrict function class

n Model Selection and Bias/Variance Tradeoff

4

Terminologyn Input(s) –measured

or preset (X) n Predictor var(s)n Independent var(s)

n Output(s) (Y) (G)n Responsen Dependent varn Target

n Types of variablesn Quantitative

{Infinite set}n Categorical {finite

set}n Group Labels n Codes (dummy vars)

n Ordered(no metric)n Small, medium,

large

5

Regression and Classificationn Both Tasks Similar

n Given the value of an input vector X, make a good prediction of the response Y.n Function approximationn y~f(x)

n Given n A set of Examples

(Training Set)

n A Performance Evaluation criteria e.g.,n Least Squares Errorn Classification error

n Find an Optimal Prediction Proceduren An Algorithmn Black boxn Analytic expression( , ), 1,...,i iY X i n=

6

Out of Sample Performancen Given x , prediction

(forecast) f(x) = for response yn

n Out of (training) sample Performance Crucialn Training set error small,

may imply good overall performance.

n However, over-fitting (interpolation) to training sample may lead to a really bad overall performance.

n Treat classification as regression problem, e.g., n Denote the binary

coded target as Y,n Treat as quantitative

output. n If predictions are

in [0,1], then

Ŷ

ˆIf Y (G), Y should be in (G) too∈R R

Ŷ

ˆ ˆˆIf y > 0.5, G =1, otherwise G =0.

7

Supervised Learning - Classification

n Discriminant Analysis (DA)n Linear, Quadratic, Flexible, Penalized, Mixture

n Logistic Regressionn Support Vector Machines (SVM)n K-Nearest Neighbors (NN)n Adaptive k-NNn Bayesian Classificationn Genetic Algorithms

8

Supervised Learning – Classification and Regression

n Linear Models, GLM, Kernel methodsn Generalized Additive Models (Hastie & Tibshirani, 1990)n Decision Trees

n CART (Classification and Regression Trees) (Breiman, etc. 1984)n MARS (Multivariate Adaptive Regression Splines) (Freiman, 1990)n QUEST (Quick, Unbiased, Efficient Statistical Tree) (Loh, 1997)

n Decision Forestsn Bagging (Breiman, 1996)n Boosting (Freund and Schapire, 1997)n MART (Multiple Additive Regression Trees) (Freiman, 1999)

n Neural Networks (Adaptive Non-linear Models)

9

Least Squares and Nearest Neighbors

n Linear model fit by Least Squaresn Makes huge structural assumption

n a linear relationship,

n yields stable but possibly inaccurate predictionsn Method of k-nearest Neighbors

n Makes very mild structural assumptionsn points in close proximity in the feature space have

similar responses (needs a distance metric)

n Its predictions are often accurate, but can be unstable

10

Least Squaresn Linear Model n Intercept : Bias in machine learningn Include a constant variable in X.n In matrix notation, , an inner

product of x and .n In (p+1) dimensional Input-output

space, is a hyperplane, including the origin.

∑=

+=p

jjjXf

10)( ββX

0β

ˆˆ TY X β=

β̂

ˆ( , )X Y

11

Least Squares

n Choose the coefficient vector to minimize Residual Sum of Squares RSS(β)

n Differentiating wrt β:n If XTX is non-singular,

∑ ∑∑= ==

−−=−N

i

p

jjiji

N

iii xyxfy

1 1

20

1

2 )())(( ββ

ˆ( ) 0TX y X β− =1ˆ ( )T TX X X yβ −=

12

Least Squares-Geometrical Insight

13

Linear model applied to Classification

n Figure 2.1: Classification example.

n The classes are coded as a binary variable—GREEN = 0, RED = 1—and then fit by linear regression.

n The line is the decision boundary defined by xTß = 0.5.

n The red shaded region denotes that part of input space, classified as RED, while the green region is classified as GREEN.

14

Nearest Neighborsn Nearest Neighbor

methods use those observations in the training set closest in the input space to x.

n K-NN fit for ,

n k-NN requires a parameter k and a distance metric.

n For k = 1, training error is zero, but test error could be large (saturated model).

n As k , training error tends to increase, but test error tends to decrease first, and then tends to increase.

n For a reasonable k, both the training and test errors could be smaller than the Linear decision boundary.

Ŷ

( )

1ˆ( )i k

ix N x

Y x yk ∈

= ∑

↑

15

NN Example

16

K vs misclassification errorn How to choose k?

Cross-validationn Bayes Error: If the

underlying joint distribution was known (lowest expected loss)

n Test Error could be larger or smaller than the training error.

n Popular methods are variants of one of these methods

17

Decision Theory Frameworkn Inputs-Output are

random variablesn INPUTS (FEATURES)

Vector:n Real Valued OUTPUT: Yn Joint prob. dist., Pr(Y,X)

is unknown.n Seek a function for

predicting Y. given values of input X.

n Loss (Cost) function L(Y, f(X)): penalizing errors in prediction

n Most common and convenient Squared Error Loss: L2 loss

n Criterion for Choosing f:n Minimize Expected Loss

pX R∈

2( , ( )) ( ( ))L Y f X Y f X= −

2

2

( , ( )) ( ( ))

( ( )) Pr( , ) ( )

E L Y f X E Y f X

y f x dx dy ESPE f

= −

− =∫

( )f X

18

Decision Theory Framework-2n Iterated expectation

rule

n Suffices to minimize ESPE pointwise for each X=x

n Optimal solution

n Conditional Mean of Y given X=xn Regression function

n Best Predictor of Y for any point x: Conditional mean, if one wishes to minimize Average squared prediction error. n Optimal Solution is

criterion dependentn Average Absolute

Prediction error- L1 loss

2|( ) [ {( ( )) | }]X Y XESPE f E E y f x X= −

2|( ) argmin {( ) | }]c Y Xf x E y c X x= − =

( ) [ | ]f x E Y X x= =

19

Linear Regression versus NNn Linear Regression- Assumed

Model:n Thenn Corresponding solution

may not be conditional mean, if our assumption is wrong!

n Not conditioning on x. n Estimates based on

pooling over all x’s, assuming a parametric model for .

n NN-methods attempt to estimate the regression, assuming only that the responses for all x’s in a small neighborhood are close.

n Typically, we have at most one observation at any particular point. So

n Conditioning at a point relaxed to conditioning on a region close to the target point x.

( ) Tf x x β≈1[ ( )] ( )TE XX E XYβ −=

ˆ ( ) [ | ( )]i i kf x Ave y x N x= ∈

( )f X

20

Linear Regression and NNn In both approaches, the conditional expectation over

the population of x-values has been substituted by the average over the training sample.

n Empirical Risk Minimization (ERM) principle.n Least Squares assumes f(x) is well approximated by a

global linear function [low variance (stable estimates) , high bias].

n k-NN only assumes f(x) is well approximated by a locally constant function- Adaptable to any situation [high variance (decision boundaries change from sample to sample), low bias].

21

Popular Variations & Enhancements

n Kernel methods use weights that decrease smoothly to zero with the distance from the target point, rather than 0/1 weights used by k-NN methods.

n In high-dimensional spaces, kernels are modified to emphasize some features more than the othersn [variable (feature) selection]n Kernel design – possibly

kernel with compact support

n Local regression fits piecewise linear models by locally weighted least squares, rather than fitting constants locally.

n Linear models fit to a basis expansion of the measured inputs allow arbitrarily complex models.

n Projection pursuit and artificial neural network models consists of sums of non-linearly transformed linear models.

22

Categorical Output gn y-f(x): not meaningful error -

need a different loss fn.n When G has K categories,

the loss function can be expressed as a K x K matrix with 0 on the diagonal and non-negative elsewhere.

n L(k,j) is the cost paid for erroneously classifying an object in class k as belonging to class j.

n 0-1 loss used most often. All misclassifications cost the same unit amount.

n Exp. Prediction Error =

n As before, suffices to minimize EPE pointwise:

n For 0-1 loss, Bayes classifier, is same as choosing the most probable class, using the conditional distribution Pr(G|X). Its error rate is called Bayes rate. (Not same as Bayesian inference).

( , )ˆ[ ( , ( )]G XE L G G X

g1

ˆ ( ) argmin ( , )Pr( | )K

k kk

G x L g g g X x=

= =∑g

23

Bayes Classifier - Examplen Knowing the true joint

distribution in the simulated example, can get the Bayes optimal classifier.

n k-NN classifier approximates Bayes solution - a majority vote in a nearest nbd amounts to replacing the cond. prob. at a point with cond. prob. in a nbd and estimate this prob from the training sample proportion.

24

Classification via regression with dummy response variable n For the two class, code g by a binary Y, Y=1 if in

group 1, 0 otherwise, followed by squared error loss estimation.

n For the K-class problem, use K-dummy variables.n Classification to the class j with largest fitted value.n Exact representation, but with linear regression, the

fitted function may not be positive, and thus not an estimate of class probability for a given x.

n Modeling Pr(G|X) will be discussed in Chapter 4.

25

Local Methods in High Dimensions

n With a reasonably large set of training data, intuitively we should be able to find a fairly large neighborhood of observations close to any x

n Could estimate the optimal conditional expectation by averaging k-nearest neighbors.

n In high dimensions, this intuition breaks down. Points are spread sparsely even for N very largen This phenomena is called

“curse of dimensionality.”

n Input uniformly dist. on an unit hypercube in p-dimensionn Volume of a hypercube in

in p dimensions, with an edge size a is

n For a hypercubical nbd about a target point chosen at random to capture a fraction r of the observations, the expected edge length will be .

pa∝

1/( ) ppe r r=

26

Curse of dimensionality

27

Curse of dimensionality-2n As p increases, even for

a very small r, approaches 1 fast.n To capture 1% of the

data for local averaging,n For 10 (50) dim, 63%

(91%) of the range for each variable needs to be used.

n Such nbd are no longer local.

n Using very small r leads to very small k and a high variance estimate.

n Consequences of sampling points in high dimensionsn Sampling uniformly

within an unit hyperspheren Most points are close to

the boundary of the sample space.

n Prediction is much more difficult near the edges of the training sample –extrapolation rather than interpolation.

( )pe r

28

Curse of dimensionality-3n Sampling density prop. to N^(1/p)

n Thus if 100 obs in one dim are dense, the sample size required for same denseness in 10 dimensions is 100^10 (infeasible!)n In high dimensions, all feasible training samples sparsely

populate the sample space.n Bias-Variance trade-off phenomena for NN methods

depends on the complexity of the function, which can grow exponentially with the dimension.

n Consider the Examples for two models in Figures 2.7 and 2.8

29

Bias-variance of 1-NN True model complex

30

Bias-variance of 1-NN Small complexity model

31

Linear Model vs 1-NN

32

Summary-NN versus model based prediction

n By relying on rigid model assumptions, the linear model has no bias at all and small variance, while the error in 1-NN is substantially larger.

n If assumptions wrong, all bets are off and 1-NN may dominate

n Whole spectrum of models between rigid linear models and flexible 1-NN models, each with its own assumptions and biases to avoid exponential growth in complexity of functions in high dimensions by drawing heavily on these assumptions.

33

Statistical Models

n Assume response follows the modeln Y = Signal + noisen Signal =n In some problems noise

may be zero. Many problems in ML deal with learning a deterministic fn., but randomness enters through the x-location of training points

n Noise are iid not necessary, but useful for examining EPSE criterion

n In general, the conditional dist of Pr(Y|X) can depend on X in complicated manner, but additive error model precludes this.

n Additive error models not meaningful for qualitative output G.n Directly model the target

function Pr(G|X), e.g. Bernoulli trials for two class problem

( )f x

34

Supervised Learning

n Function fitting paradigm in MLn Error additive, Model n Supervised learning (learning f by example) through a

teacher.n Observe the system under study, both the inputs and

outputsn Assemble a training set T =

n Feed the observed input xi into a Learning algorithm, which produces

n Learning algorithm can modify its input/output relationship in response to the differences in output and fitted output.

n Upon completion of the process, hopefully the artificial and real outputs will be close enough to be useful for all sets of inputs likely to be encountered in practice.

( )y f x ε= +

( , ), 1,....,i ix y i N=

ˆ ( )if x

35

Function Approximation

n In statistics & applied math, the training set is considered as N points in (p+1)-dim Euclidean space

n The function f has p-dim input space as domain, and related to the data via the model

n The domain is n Goal: obtain useful

approx to f for all x in some region of n Assume that f is a

linear function of x’sn Expressed in terms

of a linear basis expansions( )y f x ε= +

pR

pR

1

( ) ( )K

k kk

f x h xθ θ=

= ∑

36

Bases and Criteria for function estimation

n The basis functions h(.) could be n Polynomial (Taylor Series

expansion)n Trignometric (Fourier

expansion)n Any other basis

(wavelets) n non-linear functions,

such as sigmoid function in neural network models

n Mini Residual SS (Least Square Error)n Closed form solution

n Linear modeln If the basis functions

do not involve any hidden parameters

n Otherwise, needn iterative methodsn numerical (stochastic)

optimization

37

Criteria for function estimationn More general

estimation methodn Max. Likelihood

estimation-Estimate the parameter so as to maximize the prob of the observed sample

n Least squares for Additive error model, with Gaussian noise, is the MLE using the conditional likelihood

n Multinomial likelihood for regression function Pr(G|X)n L is also called the cross-

entropy1

( ) ln Pr ( )N

ii

L yθθ=

= ∑

38

Need for Structured Regression Models

n RSS for an arbitrary function fn Infinitely many solutions : interpolation with any function

passing through the observed point is a solution [Over-fitting]

n Any particular solution chosen might be a poor approximation at test points different from the training set.

n Replications at each value of x – solution interpolates the weighted mean response at each point.

n If N were sufficiently large, so that repeats were guaranteed, and densely arranged, these solutions might tend to the conditional expectations.

n In order to obtain useful results fro finite N, the class of functions should be restricted

39

How to restrict the class of estimators?

n The restrictions may be encoded via parametric representation of f.

n Built into the learning algorithm

n Different restrictions lead to different unique optimal solutionn Infinitely many possible

restrictions, so the ambiguity transferred to the choice of restrictions.

n Generally, most learning methods: complexityrestrictions of some kindn Regularity of in

small nbd’s of x in some metric, such as special structuren Nealy constantn Linear or low order

polynomial behaviorn Estimate obtained by

averaging or fitting in that nbd.

n

ˆ ( )f x

40

Restrictions on function classn Nbd size dictate the

strength of the constraintsn Larger the nbd, the

stronger the constraint and more sensitive the solution to particular choice of constraint

n Nature of constraint depends on the Metricn Directly specified metric

and size of nbd.n Kernel and local

regression and tree based methods

n Splines, neural networks and basis-function methods implicitly definenbds of local behavior

41

Neighborhoods naturen Any method that attempts to produce locally

varying functions in small isotropic nbds will run into problems in high dimensions –curse of dimensionality.

n All method that overcome the dimensionality problems have an associated (implicit and adaptive) metric for measuring nbds, which basically does not allow the nbd to be simultaneously small in all directions.

42

Classes of Restricted Estimators

n Roughness penalty and Bayesian methodsn Penalized RSS

n RSS(f) + J(f)n User selected functional J(f)

large for functions that vary too rapidly over small regions of input space, e.g., cubic smoothing splinesn J(f) = integral of the

squared second derivative

n controls the amount of pemalty

n Kernel Methods and Local Regression provide estimates of the regression function or conditional expectation by specifying the nature of the local nbdn Gaussian Kerneln k-NN metricn Could also minimize kernel-

weighted RSS

n These methods need to be modified in high dimensions

λ

λ

43

Basis functions and dictionary methods

n Wide variety of flexible modelsn Model for f is a linear

expansion of basis functions

n Prescribed basis functions, polynomial basis of degree M.n One dimensional x, a

polynomial spline of degree K can be represented as an appropriate sequence of M spline basis functions, determined by M-K knots.n Piecewise polynomials

of degree K between the knots, joined with continuity of degree K-1 at the knots.

1

( ) ( )K

k kk

f x h xθ θ=

= ∑

44

Radial Basis functions and Neural Nets

n Symmetric p-dim kernels located at particular centroidsn Gaussian Kernel is popularn The centroids and scales have to be determinedn Splines have knots

n Data should dictate these quantitiesn Regression problem becomes combinatorially hard non-

linear problem. Need greedy algorithms or two stage procedures

n A single layer feed forward neural net model with linear output weights is an adaptive basis function method n known as dictionary methods, where one has available a

possibly infinite set or dictionary D of candidate basis functions to chose from, and model is built using a search mechanism.

45

Model Selection and Bias-Variance Tradeoff

n Smoothing or complexity parameter to be determined:n Multiplier of the

penalty termn The width of the

kerneln Number of basis

functions

n Can’t min residual ss without any restriction, since that will produce an overfitted(interpolated), that will not work for test set.

n The k-NN fitted regression, the training set fixed in advanced (non-random), illustrates the competing forces in model selection.

0ˆ ( )kf x

46

Model Selection and Bias-Variance Tradeoff

n Model with

n EPSE at x0 – Test or generalization error

EPSEk(x0) =

( )Y f X ε= +( ) 0E ε =

2 20 0

22 2

0 ( )1

ˆ ˆ[ ( ( ) ( ( ))]

1[ ( ) ( )]

k T k

k

ll

Bias f x Var f x

f x f xk k

σ

σσ

=

+ +

= + − +∑

Supervised Learning: Overview...n Local regression fits piecewise linear models by locally weighted...

Documents

Transcript of Supervised Learning: Overview...n Local regression fits piecewise linear models by locally weighted...