CN700: HST 10.6-10.13

CN700: HST 10.6-10.13

Neil Weisenfeld(notes were recycled and modified from Prof.

Cohen and an unnamed student)

April 12, 2005

Robust Loss Functions for Classification

• Loss functions that lead to simple boosting solutions (squared error and exponential) are not always the most robust.

• In classification (with –1/1 response), the “margin” plays the role that residuals do in regression: y.f(x)

• Incorrect classification is negative (-1*1, 1*-1)• Loss criteria should penalize negative margins more

heavily (positive margins are correctly classified)

Loss functions for 2-class classification

• Exponential and binomial deviance -> monotone continuous approximations to misclassification loss

• Exponential more heavily penalizes strong negatives, while deviance is more balanced.

• Binomial deviance more robust in noisy situations where Bayes error rate is not close to zero.

• Squared error is poor if classification is the goal (it grows above zero)

Loss functions for k-classes• Bayes classifier: • If not just interested in assignment, then the class

probabilities are of interest. Logistic model generalizes to K classes:

• Binomial deviance extends to K-class multinomial deviance loss function:

( ) where arg max ( )k ll

G x k p x kG

( )

( )1

( )k

l

f x

k K f xl

ep xe

1

( )

1 1

( , ( )) ( ) log ( )

( ) ( ) log l

K

k kk

K Kf x

k kk l

L y p x I y p x

I y f x e

G

G

Robust loss functions for regression

• Squared error loss too-heavily penalizes large absolute residuals |y-f(x)| and is threfore not robust.

• Absolute error better choice• Huber loss (below) deals

well with outliers and nearly as efficient as least squares for Gaussian errors.

2[ ( )] , for | ( ) | ,( , ( ))

(| ( ) | / 2), otherwisey f x y f x

L y f xy f x

Inserted for completeness

Boosting Trees

• Trees partition the space of joint predictor variable values into disjoint regions with constant predictor values assigned to each region:

• Tree can be formally expressed as:

• Parameters found by minimizing empirical risk:

jR

j

1

1

( ; ) ( )

with parameters { , }

J

j jj

Jj j

T x I x R

R

1

ˆ arg min ( , )i j

J

i ij x R

L y

Boosting Trees• Formidable combinatorial optimization

problem• Divide into two parts:

1. Find : typically trivial2. Find : use a greedy, top-down recursive

partitioning algorithm (e.g. Gini index for misclassification loss in growing of tree)

• Boosted tree model is the sum of such trees (from forward, stagewise algorithm):

given j jR

jR

1

( ) ( ; )M

M mm

f x T x

Boosting Trees• The Boosted Tree Model is “induced in a forward,

stagewise manner”

• At each step in the procedure, one must solve:

• Given the regions at each step, the optimal constants are found:

1

( ) ( ; )M

M mm

f x T x

11

ˆ arg min ( , ( ) ( ; ))m

N

m i m i i mi

L y f x T x

1ˆ arg min ( , ( ) )jm i jm

jm i m i jmx R

L y f x

Boosting Trees• For squared error loss, this is no harder than

for a single tree: at each stage you create the tree that best predicts the current residuals

• For two-class classification and exponential loss, we have AdaBoost

• Absolute error or Huber loss for regression and deviance for classification would make for robust trees, but there are no simple boosting algorithms.

Boosting Trees: Numerical Optimization

A variety of numerical optimization techniques exist for finding the solution to the above problem. They all work iteratively, in the sense that the function is approximated by taking an initial guess at its value, and computing successive adding functions to it, each of which is computed on the basis of the function at the previous iteration.

Boosting Trees: Steepest Descent

• Move down the gradient of L(f)

• Very Greedy => Can get stuck in local minima• Unconstrained => Can be applied to any system (as long as

gradient can be calculated)

1

1

1

( ) ( )

arg min ( )

" "

( , ( )( )

i m i

m m m m

m p m m

i iim

i f x f x

f f p g

where p L f pg

learning rate

L y f xgf x

Not a Tree Algorithm so…(notes from the Professor)

• Calculate a gradient and then fit regression trees by least squares.

• Advantage – No need to do linear fit• Gradient taken only w.r.t function values at

points so its as if it were one diminsional

Boosting Trees: Gradient Boosting

• But gradient descent operates solely on the training data. One idea: create boosted trees to approximate steps down the gradient.

• Boosting is like gradient descent, but each added tree moves down the loss gradient created at fm-1, and hence approximates the true gradient

• Each tree is constrained by the previous one unlike the true gradient

2

1

arg min ( ( ; ))m

N

m im i mi

g T x

MART (Multiple Additive Regression Trees): Generic Gradient Tree Boosting Algorithm

• 1. Initialize:

• 2. For m=1…M:– A) For i=1,2,…,N (pseudo-

residuals)

– B) Fit a regression tree to the targets r_im giving terminal regions R_jm, j=1,…,J_m

– C) For j=1,2,…,J_m:

– D) update:

3. Output:

01

( ) arg min ( , )N

ii

f x L y

1

( , ( ))( )

m

i iim

i f f

L y f xrf x

1arg min ( , ( ) )i jm

jm i m ix R

L y f x

11

( ) ( ) ( )mJ

m m jm jmj

f x f x I x R

ˆ ( ) ( )Mf x f x

Right-Sized Trees for Boosting• At issue: for single-tree methods we create deep trees and then prune them. How to

handle for these complex, multi-tree methods?• Maybe set a constant depth in terms of the number of terminal nodes J.• Number of terminal nodes relates to degree of coordinate variable interactions that are

considered.

• Consideration of the ANOVA (Analysis of Variance) expansion of the “target” function:

• ANOVA expansion:

• Yields an approach to Boosting Trees in which the number of terminal nodes in each of the individual trees is set to J, where J-1 is the largest degree of interaction we wish to capture about the data.

arg min ( , ( ))XYfE L Y f X

( ) ( ) ( , ) ( , , ) ...j j jk j k jkl j k lj jk jkl

X X X X X X X

Effect of Interaction Order

• Just to show how degree of interaction relates to test error in the simple example of 10.2.

• Ideal J=2, so boosting models with J>2 incurs more variance.

• Note J is not the “number of terms”

Regularization• Aside from J, the other meta-parameter of MART

is M, the number of iterations. • Continued iteration usually reduces training risk,

but can lead to overfitting.• One strategy is to estimate M*, the ideal number

of iterations, by testing prediction risk, as a function of M, on a validation sample.

• Other regularizations strategies…

Shrinkage

• The idea of shrinkage is to weight the contribution of each tree by a factor between 0 and 1. Thus, the MART update rule can be replaced by:

• There is a clear tradeoff between the shrinkage factor and M, the number of iterations.

• Lower values of require more iterations and longer computation, but favor better test error. Best strategy seems to be to suck it up and set low (<0.1).

11

ˆ( ) ( ) ( )J

m m jm jmj

f x f x I x R

Shrinkage and Test Error

• Again, the example of 10.2

• Effect especially pronounced when using deviant binomial deviance loss measure, but shrinkage always looks nicer.

• HS&T have led us down the primrose path.

Penalized Regression

• Taking the set of all possible J terminal node trees realizable on the data set as basis functions, the linear model is:

• Penalized regression includes a penalty J() for large numbers of parameters:

1

( ) ( )K

k kk

f x T x

2

1

ˆ( ) arg min ( ) ( )N

i k k ii k

y T x J

Penalized Regression

• Penalties can be, for example, like ridge or lasso:

• However, direct implementation of the above procedure is computationally infeasible (due to the requirement that all possible J-terminal node trees have been found). Forward stagewise linear regression provides a close approximation to the lasso and is similar to boosting and algorithm 10.2

1 1

( ) (ridge) ( ) | | (lasso)K K

k kk k

J J

1. Initialize:

2. For m = 1 to M:

3. Output:

Forward Stagewise Linear Regression

2

,1 1

( , ) arg min ( ( ) ( ))N K

i l l i k iki l

k y T x T x

( )k k

sign

1

( ) ( )K

k kk

f x T x

ˆ 0, 1 . Set 0 to some small constant and large.k k K M

Increasing M is like decreasing . Many coefficients will remain at zero. Others will tend to have absolute values less than their least squares defaults.

Lasso vs. Forward-Stagewise (but not on trees)

• Just as a demonstration, try this out with the original variables, instead of trees, and compare to the Lasso solutions.

Importance of Predictor Variables (IPV)

• Find out which variable reduces the error most

• Normalize other variables influence w.r.t. this variable

12

1

2

arg max ( ( ) )

where v(t) is thesplit variableat node t,

is thesquared loss when not using variable ,and weconsider all nodes but the leaves

J

l tt

t

l l I v t l

l l

IPV: Hints

• To overcome greedy splits, average over many boosted trees

• To prevent masking—where important variables are highly correlated with other important ones—use shrinking

IPV Classification• For K-class classification, fit

a function for each class, and see which variable is important within each class.

• If a few variables are important across all classes:1.) Laugh your way to the bank,

and2.) Give me some $ for teaching

you this.

IPV: Classification

Arrange each variable’s importance in a matrix with p rows and K classes

• Columns separate within a class

• Rows separate between classes

Partial Dependence Plots (PDP’s)

• Problem: After we’ve determined our important variables, how can we visualize their effects?

• Solution1: Give up and become another (clueless but rich) manager

• Solution2: Just pick a few and keep at it (who likes the Bahamas anyway?)

What they are in Limit

• PDP

1

2

( ) ( ( , )) ( , ) ( , )

( ( ) ( ) ( )

( ( ) ( )) ( )

( , ) ( , )Not ( ( , ) | )

( , )

PD Y

Y

y

f x E f x y f x y p x y dxdy

E f x g y f x c

E f x g y f x c

f x y p x y dyE f x y x

p x y dx

PDP’s: Conditioning• To visualize d (> 3)

dimensions, condition on a few input variables– Like looking at slices of the

d dimensional surface.– Set ranges if necessary

• Especially useful when– interactions are limited– and those variables have

additive or multiplicative effects

PDP’s: Finding Interactions

• To find interactions, compare partial dependence plots with their relative importance

• If the importance is high yet the plot appears flat, multiply it with another important variable

CN700: HST 10.6-10.13

Documents

Transcript of CN700: HST 10.6-10.13