Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One...

32
Logistic Regression for Binary Data Discriminant analysis: Model f k (x) and use Bayes theorem to get p k (x) Logistic regression: Model p k (x) directly — enough to focus on p 1 (x) as p 2 (x)=1 - p 1 (x). Assume: The two classes are coded as 0/1 — 1 for ‘success’, 0 for ‘failure.’ Thus, the response Y Bernoulli (p), where p = P (Y = 1) = E(Y ). Letting p depend on the covariate vector X , we get p(x)= P (Y =1|X = x)= E(Y |X = x). A common statistical modeling principle: Find an function g of E(Y |X ) whose value can be any real number & use g{E(Y |X = x)} = x T β. RHS is linear in β linear model structure Linear model: g is identity, i.e., g{E(Y |X )} = E(Y |X ) What is the need for g, aka link function? 1 / 32

Transcript of Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One...

Page 1: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Logistic Regression for Binary Data

Discriminant analysis: Model fk(x) and use Bayes theoremto get pk(x)

Logistic regression: Model pk(x) directly — enough to focuson p1(x) as p2(x) = 1− p1(x).

Assume: The two classes are coded as 0/1 — 1 for ‘success’, 0for ‘failure.’ Thus, the response Y ∼ Bernoulli (p), wherep = P (Y = 1) = E(Y ). Letting p depend on the covariatevector X, we get p(x) = P (Y = 1|X = x) = E(Y |X = x).

A common statistical modeling principle: Find an ↑function g of E(Y |X) whose value can be any real number & use

g{E(Y |X = x)} = xTβ.

RHS is linear in β — linear model structure

Linear model: g is identity, i.e., g{E(Y |X)} = E(Y |X)

What is the need for g, aka link function?

1 / 32

Page 2: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

How to model p(x)?

Linear regression: p(x) = xTβ — identity link

No guarantee that p(x) ∈ (0, 1) for all x

Logistic regression (our focus): logit{p(x)} = xTβ, where

logit{p(x)} = log

{p(x)

1− p(x)

}

is the logit or log-odds function — logit link, i.e., g = logit

p(x) = exp(xT β)1+exp(xT β)

— logistic function — always in (0, 1)

As p ↑ in (0, 1), odds ↑ in (0,∞). Thus, odds close to zeroor ∞ indicate very small or very large probabilities.

2 / 32

Page 3: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

4.3 Logistic Regression 131

0 500 1000 1500 2000 2500

0.0

0.2

0.4

0.6

0.8

1.0

Balance

Pro

babi

lity

of D

efau

lt

| | | ||| |||| | || | || || || ||| | | | ||| | ||||| | | || || ||| |||| | | ||| || || | || ||| || ||| | | || | || || | ||| | | || | | ||| || || | || || | ||| ||| | |||| | ||| | || | ||| | ||| || || || ||| |

|

|| || || || | ||| || | || || ||| ||| || || | || | | | |

|

| | ||| | |||| | | |||| | || || ||| || |

|

| || |

|

||

|

|| | || || | ||| | ||| | || ||| || | |||| | ||

|

|

|

| || | | | || | || || ||| || |

|

| ||| || | | ||| | || || | | |||| | || |||| ||| | || || | | | || ||| || || | || | || ||| | || || | || | || || | ||| ||

|

| ||

|

| | |

|

||| | || |

|

|| || || ||| || | |||| | | || |||| | |||| | | || | | | | || || || || | ||

|

| | || || || ||| |||||| || ||| || | || ||| | |

||

| || ||||| | ||| | ||| || | | || || | || ||||| ||| | ||| | || ||| |

|

| || | ||| || ||||| | | ||| ||| | | | ||| || | | ||| | | ||||| |||| | | || ||

|

| | ||

|

| | || | | | ||||| ||| || ||| || | || || |||

|

| || |

|

|| | | || | |||| | ||| || || | || || | | || | || || |||| || |||| || | || |||| || | ||| |

|

| || | |||| |

|

| || || || | |||| || | | | ||| ||| | || | || |||| | ||| || | ||| | || | |||| || || || || ||

|

| || |||| | | || | || ||| || | |||| | |

|

||||| | | | |||| | || || | ||

|

| || |||| | || || || | || |||| || | ||| || | | ||| | || ||||

|

| |||| || | | | || ||| || || ||||| | ||| |

|

||| | | ||| | | ||| | | ||| | || || |||| | |||| |

|

|| || | | || || | || ||| | || || || || ||| ||| || | | ||| ||||| | ||| | | || |

|

| ||| ||| | || |

|

|||| |||||| | |||| || | |||| ||||||| || |||| | || | ||

|

|| | || |

|

| ||| || | ||| || |||| |

|

| | ||| | | | || | || | ||||

|

| | ||

|

| || | | |||| | | | | ||| || |

|

|| | |||| | ||| | || || || | || || ||| || || | ||| | | || |||| || || | || | | | ||

|

|| | | || ||||| | ||| | || | || || | || | ||| | || || ||

|

|| || |

| |

|| | | || | ||||| | |||

|

| || | | | || | ||| | || || | || | | | |||| |||| || ||| | | |||| ||| || |

|

|| | ||

|

|| |||| | || | | ||| | ||| | | ||| | ||| | || | | || |||| |

|

|| |||| | || ||| ||| || | || | || |||| || || || | || || | |||| | || || || ||||| | |||| | |||| || || | || || || || || |||| || | | || || || ||| | | | |||

|

|

|

|| |||| | | | |||| | || ||| || | || | | | | |||| |

|

||| | || |||| || ||| | | ||| | || | | || || ||| ||| | | || | | || || || | |

|

|

|

| ||| | ||||||| | | || || |||| || | || || || ||| | |

|

||

|

| | || || ||

|

| ||| |

|

||| ||| | |||| || ||| || |||||| || | || | || || || || ||| ||| |

|

|| ||| || |||| | ||| || |||| || || | ||||| | | || ||| |||

|

| |||| || ||| ||| ||| | |

|

|||| ||| | ||| | |||

|

||| || | || || | | || | || | |||

|

|| | || | || || | || || | | |||| || | || | ||||| ||| || | | ||| ||| || || | || | |

|

|| ||| |

|

| || | | |||| | | || ||| ||| ||| ||| ||| | || || || | ||

|

|| ||| | || || || || | || |||| | | | || | || | |

|

|| | || | | || | | ||| || | |||| | || || || || | ||| ||| |||| || ||||

|

|||

|

|| |||| || || |

|

| || || || || || | | ||||

|

| ||| || || || || |

|

| |||| ||| || | ||| ||| ||

|

| || || | |||| || || ||| ||| || | || | | | ||| |||| | ||||||

|

|

|

| | || | || | | | ||| | | || || ||| | |||| || | || || | || || | || || ||| ||| | || ||| | | ||

|

| |||| ||

|

| | ||| | || ||| | || || | || || |

|

|| | ||| || || | ||| |

|

| ||| || ||| || | | |

|

||| | | |||| | ||| || || || ||| || ||

|

|| | |

|

| ||| | | | ||| |||| |||| | || ||| || |||

|

|| | | || |||| | | | || ||||| | | | | | ||| | || |

|

| | ||| | ||| ||| |

|

|| || ||

|

| || ||| ||| || | | || || | ||

|

|| |

|

| || | | | |||| |

|

| | || | ||| || || |||| | || || |

|

||| | || || ||| || | | ||||| || | | | |||| || || || ||| || ||

|

||| || || | | || ||| || | | | |||| || |||| || || |||| | |

|

|| || |||| ||||| | ||

|

| || ||| || | | || |

|

|| ||||| |||| |||| | | |||| | || | | || | || || ||| | ||| ||| | | ||| ||| | || | || ||| ||| || | || | ||| ||| | || || || | || || || || | |

|

|| ||| | |||||

|

|| | || || ||| ||| || | | | ||||| | | ||| || || || | ||| ||| | |||| || |||| | || || | ||| ||| || | || | | | ||| ||

|

| ||| | || | | | |

|

| | || || || | || || ||| || | || ||| || | ||| || || ||| | ||| || |||| | ||| ||

|

| |||| || | || | |||| || | || || || | ||| || || | || ||| ||| || |||| |

|

||| | || || || ||| | |||| || || || || | ||| | || | ||| |||

|

||||| | | | | ||| || ||| | || || ||| || || ||| || | ||| | ||| | | || | | || | || | |||

|

|| || || || || || ||| | | ||| || | || ||| |||

|

|| | || || | || | | |||

|

|| ||| | ||||| ||| | || | ||| || |||| || || | || || | || | || | || | |||| | || || ||| || | |||||| || || |||| ||| || || |||| | |

|

| || || || | ||| ||| |||| | || | | |||| || ||||| | || ||

|

| | || | || ||| |||| | || ||| || || | || ||| || || | ||||| | || | | | | || | || ||| |||| | | ||| |||| || | | ||| | ||| || | || | || | || ||| | | || | || || |||| | || |

|

|| ||| | |||| ||

|

|| | | || | ||| | || || || | |||| || ||| | || || | | || || | | || | | | ||||

|

|| | || || | || |||

|

| || | |

|

|| | ||| ||| || | ||| ||| | || || ||||| ||| |

|

| || ||

|

|| ||| | || | ||| ||| || |

|

|| | || ||

|

|| | | | | || ||| | ||| | || |||

|

| || ||| | ||| || | | ||| | || | || | | |

|

|| |||| | ||

|

|||| || || ||| || ||| ||| | || | || | ||| | ||||| |

|

|| || | ||| | ||

|

| || ||| ||

|

| | |||| |

|

|| || | |||

|

|| ||| | | ||||| | | | |||| ||| | | || ||| || | || || || | || |||| || | | |||

|

||

|

| | || |

|

|| | | |

|

|| || | ||||| ||| || | | || | | || | || ||| | ||| || | |||| | | |||

|

|| ||| || || || || ||| | | || || ||| | ||| ||| | || || || || || |||||

|

| | |

|

| | | |||| || ||| | ||| | || | ||| ||| | | ||||| | || | || | | ||| | | ||| | || |||| | | || | || || | || |||

|

| || |||| || | | | |||| ||| | || || || |||| | |||

|

||| || | || ||||| |

|

|| ||| | ||| || | | || | ||| | || ||| | ||

|

| || | | || | | || || | ||| || | || | ||||| || | | || | | || || || ||| | | || || | || || |||| || || | | || ||||

|

|| |

|

|| ||| | | | ||| || ||| || | ||| || || || | | | |||| | | || || || ||| || || ||| || || | || | || | |

|

||| | || | || || || | |||| || | || | | ||| |||| | | || | ||| | || | | ||| | || || | ||| |||| || ||| | | ||

|

| || | || || || ||| || ||| | | ||| | |

|

| ||||| || | ||| | || ||| || || | |||| ||||

|

|||| |||

|

| ||| | || ||| | | | |||| | |||| | || || | || | | || |

|

|| | | || | | | ||

|

| || ||| || | | || || || | ||| | ||| || | || | || | || || || || | ||| || | || | ||| || ||| ||| || ||| | || |||| || || |

|

| ||| ||| | | |

|

|| ||| ||| ||||

|

|| |

|

|

|

||| | ||| | ||| || || || ||| || | | |||| | ||

|

| || | || || || || | || | ||| |||| ||| || ||| |

|

||| || || | || || | |

|

|| || || |

|

|| || || ||| ||||| | | ||| | | || || ||| || | | | | ||||| | ||

|

| || || || ||| | | |

|

|| || ||

|

|| || | | ||| | ||| | | ||| | || |||| || | ||| | | | | || |||| | |||||| | | |||| || | || | ||| ||| | | | || ||| ||

|

|| |||| |||| | ||| | | || || || | | | ||| || | ||| | || || | || | | | || | |||| | | ||| |||||

|

|| | || ||| || || | | || ||| || ||| | ||| | || || | ||

|

|| || || ||

|

| || ||| | ||| ||

|

| || | | || | || ||| | || ||| | || | ||| || || | |

|

|| ||| || |||| ||

|

|| | |||| || | || ||| | | || |||| | | || ||| |

|

|||

|

| || | ||| ||||| || | ||| |||| | |||| |||| |||| | | | || || |||| ||| | || | |

|

|| || ||| ||| | || || | || | |

| |

|| |

|

| | || | ||| ||| || | | | || ||||| ||| | | ||||| |||| | || | || | ||| || ||||| || || || | | || ||| | |||| || | || |||||| || |

|

|

|

|| ||| ||| | | |||

|

| ||| || | |||

| |

| || | || | | | ||||| ||

|

|||| | ||| || | ||| ||| ||| | | | ||| | || | |||||| || || ||| | || |

|

|| |||| || || || || | | |||| || | || || || || || | | ||| || | | || | || || || | ||| | || ||| | |||| | ||||| || | || | | | ||| || | || | | || ||| | || || | || || | ||| || | || | ||| | ||| || ||| | || | || || || || | | || ||| | || | || | |||| | || | |

|

| ||| | || | ||

|

| ||

|

|| | | ||| || ||

|

|| | | || ||| |||| || | |

|

| || | || || ||| ||| ||| || || |||| | | ||||

|

| | | || ||| | | |||| || | | || || |

|

| |||| | || | |

|

| | |

|

| || | || | ||| | || || | || | || | | || || || | || || |||

|

| | | || | || ||| || | || ||| ||| || | ||| || | || | || ||| ||| || ||| | | || | || || |||| || | || || | |

|

|| ||||| || ||||| || | ||| || | | || |

|

| || | || || | || || | || |||

|

| ||| || |||| || || |||| || || || ||| | | | | ||| | ||||| || |||| | |||| || | ||| | || | ||| | || || || | || |

|

|| | || | | | || || || ||| |||| || ||||| | ||| | || || || ||| | || | | ||| | | | ||||| ||| | ||||| |

|

|||| ||| ||| ||| || || ||| | ||| || || ||| ||| || | | |||

|

| | |||| || || ||| | | ||| | || ||| || || || ||||

|

|| ||| ||

|

| | ||| | ||| ||| | ||||| | | |

|

||| | || ||| | | ||| || ||| ||| | || | || | || || ||| || || ||| |

|

|| || ||| | | ||| || ||| || | |||| ||| || | ||| || | | ||| | || || ||| | ||| || || ||| | || || || |

|

||| | || | ||| || ||| || |||| | | ||| | ||| || | |||| || | |||| |||| | | | ||| | || ||||| || || || | || | | || | |

|

||| | || || || ||

|

| || || |||

|

| | || | || | || ||| ||| | ||| | | ||| || || | || || || || |||| || ||| ||| | | ||| | |||| | |||| || ||| | || | || | || | | ||| || || | || ||||| | | |||| ||| ||

|

|| || || | ||| || | || || || ||| || ||| | | |||| ||| |

|

| ||| || || | ||| ||| ||| || || || || | | ||| | | |||

|

||| | | || || | | | | ||| | |||| || || | | ||| ||

|

||| ||| | || | || | | || ||| | ||| || || | ||| |||| |||| || | | |||| ||| || | || || | | || | | | || | | ||| | || ||| || || | || ||| || |

|

| || || | || || || | || ||| | || |||||| | | | | || | || |

|

|| | || | | || | | |

|

| || || ||| | | |||| | | | | || | ||| || || | || ||| || | || || |

|

|

|

| || ||| | | | || | || | ||| |

|

| || | | ||| || | | || | ||| | | | ||

|

|| | ||| | | ||| ||| |

|

| || || | ||| || |||| | || | | ||| | | || || || | ||||| || | ||

|

| | ||| | ||| |

|

|| | | | ||| || ||| | | || | || | ||| | | || | |||||| | || ||| || | | || | ||

|

|| ||| |

|

||| || || ||| |

|

| | | || || ||| | | | || || | | || || ||| | | || | || | ||| ||| | | ||| | | || | || | |||| | | || | || ||| | |||

| |

| || | || ||| ||| ||| | |||| | || ||| || || | ||| ||

|

| | || || | || | | |||| ||| || | |||| ||

|

| || | ||| | | ||||| | ||| | | |

|

| || | ||| || || || | | ||| || | || | |||

|

|| | ||| || | ||

|

|| | || || ||| | |

|

||| | || |||| | | || | || |||| || | | ||| ||| || | || | || ||| || |

|

| ||| || ||||| || | || || ||

|

|

|

| | | ||||| ||| | | || |||| | | | ||| | | ||| | || | || | |||| | || | | || || | |

|

| | ||

|

| ||||| || | |

|

| || || || || || || | || || | ||

|

| || || | ||

|

|||| | | || | ||| | || | ||| | | ||| || ||| | | ||| | || | || ||

|

| ||| || | || |||| ||| || | | || |||| | | | ||| || |||| | || | || | | ||| || | || ||| | |||| ||| |||

|

|| || | || || |||| | || |||| || || | |

|

| |||| || || ||

|

| || | || | |

|

|||| ||| | || | || ||||| | || || |||

|

|| || | ||| || | | | |||

|

||

|

| ||| | | ||||| | ||| ||| || |

|

|| | || ||||

|

|| | || || | || ||| | | ||| |||| | | | ||| | ||||| || | |||

|

|| | | || | || ||| || | ||| ||| ||| | ||| | | ||| | || ||| ||||| |

|

|||| ||| || | | || ||

|

||| | ||| | || ||||| | ||||| || | ||| |

|

|| | | |||| ||| | || |||| | | ||| | ||| | |||

|

|||| || ||||| | | ||| ||| | | | || || | | | || | ||| | || | | | ||| ||

|

|| || || || ||

|

||| ||| | ||| || ||| || || || |||| || | |||||| |

|

|| || | || ||||| | ||| || |||| || ||| | | |||| | |||| | || | ||| ||| | | || || | || |

|

| | || || ||| | || | ||| | | || | | || || ||| | | || || | | |||| | | || |

|

|| || | || ||| || || | | | ||| || | ||| || ||| || ||| | | ||| || || | ||| | || || | ||| || ||| || |||| | | | || || | | | |||| | | || |

|

|

|

|| | |||| | ||| || | |||| || ||| |

|

|||| | || ||||| | |||||| | || |||| || | || || | | || || | | || || || || | || ||| | | | || ||| ||

|

| | || ||| |

|

| ||| | || | | || | ||| | || || || || | ||| || | | ||| ||| || ||| ||| | |

|

| |

|

||| || ||| | | ||| || ||| || | ||| | | || ||| | ||

|

| | ||| | || ||| || | || || | |||| | | ||

|

|| |||| | |||| || | | ||||| || ||| | || || || || || | | ||||| | || ||| | ||| || || ||| | || || | | ||| ||| || | | | |||| || || || |||| | || ||| || || | ||| ||

|

|| | || || || | || | || ||| | |||| || || ||| | || ||| | | || ||| | ||| || ||| || | ||| || || |||| |||

|

|||| | || | | || ||| || | | ||| | | | || ||| ||| | ||| ||| | | || || ||| || | |

|

||||| |||| || ||

|

|| || | | | |||| |

|

|

|

|

|

|| | ||| | || | |

|

| ||| ||| || | || || |||| | | ||| | ||| | || || ||| | | ||||

|

| || || | || || ||| | | | || || ||| | || | | | ||| |

|

|| || | ||| | || || || ||| | || | | || | || || |||| || || | |||| ||| || | || || || |||| | || | |||| | | | ||| | | ||||| | || || ||| | | || || || ||| | | |||| | | |||| | | | || || | || || ||| | | || | || | | |||| | |||| | || | || | |||| || || || || || ||| || | || || || | || |||| || | || || | || | || || | ||| | | | || | | | ||| ||

|

|| || || || || | || | || || |

|

|||| | | | | ||| | || | || | | |||| | |||| | ||| | ||||||| | |||| ||| |

|

|||

|

| | || || ||| ||| || ||| || | | | | ||| | || || || || |||| | | ||| || ||

|

||||| || ||||| || ||| | ||

|

|| |||| | || | |||| | | |||| | || || | | || | | | || | | ||| || | ||

|

||| ||| || | | || || | || || | | || || | || | || |

|

| || | ||| | || | | |

|

| | ||| | |

|

| ||| || ||||| | ||||| || | | | ||||| | || | || |

|

| || | || || | |

|

| | || ||| || || || || |

|

| |||||| || || || || ||| ||| |||| | ||

|

||

|

||

| |

|| |||| | || | || | ||

|

|| ||| | || |

|

|| | ||

|

| | || ||| | || | ||| | ||| ||| || || || || || ||| ||| | ||| || |||

|

| || | | || ||||| |||| || | || || || |||| | |

|

| | ||| | ||| ||| | || | ||| || || ||||| ||| || | || | | | | |||| || | || ||||| | || || || | | | || || | | || || | ||| || |||| | || ||| | || | | ||| || | || | || | ||| || ||| || || || | ||| || | ||

|

| | | || | | ||| | |||| | | |

|

|| || | | || ||||

|

| | || | | || || || || | || ||| | | ||| ||| | | | | ||| ||| || | || |||| || | ||||

|

|| || || | || | | ||||| | || || | || | | || ||| || ||||| |

|

||

|

||| |||| || |||| ||| | || || ||| || | | || | ||| | || | | ||| || | | || || || ||| | |||| |||| ||| ||| |

|

||| | || || || |

|

||| || | || |||| | ||| ||| || ||| || ||| || || || | | || | | | || || | || || | | | ||| | | || || | ||| | | | ||

|

|||| | || | || | || || || |||| || || || | | ||| | || ||| | | || |||||

|

| |||

|

| || || || ||| | | |||| | | || || ||

|

| | ||| | ||

|

| |||

|

| | | |||| || || || ||| | || | | | | ||||| || ||| || | ||| || | || |||| |||| | | | || ||| || || ||| || | || || ||| || || |||| || | ||||| |

|

| | || | || || |||| | || || | || | | ||| | ||| | ||| ||| ||| | | || | | || | ||| || | | | || | || | | || || || ||| | | |

|

| || || | ||| | || | |

|

| ||| | |||| | || || ||| || | | | |

|

| | |||| ||| || || || || | || | ||| | | | |||| |

|

|| |||| | |||| | || | | || | |||| || || ||| | || ||| ||| |

|

| || || || | ||| ||| ||| | ||| ||| || | | || ||| | || ||| ||| || | |||| | || || || | || || |

|

| |||

|

| ||| || ||| |

|

| | ||| | |||

|

|| ||| | || || | | ||| || |

|

| ||| | ||

|

||| | |||| || || | |

|

| | |||| | ||

|

|| | |

|

|||| | |

|

| | ||| || | ||| | | ||

|

| | ||| |||| | || ||| || | || |||| | ||| | ||| | || || || | | | | || ||| | | |||| | | || || ||| || | |||| |

|

|| || ||| || || || ||| | ||| ||||| ||| || || | | || ||||| ||

|

| || |||| |||||| | | ||| || ||| ||| | || || ||| || || ||| ||| || || || || | | | | || | | || || || ||| | | || | || || || | | || || || || |||| | | || | | |||| || || ||||| | ||| || | | || | ||| |

|

| |

|

| ||| | ||| |

|

| | || || || || | | || ||

|

|| || | ||| | || | | ||| || ||| | | || || || |||| | || ||| | ||

|

|| || ||| | || || | || | |||| || | |||

|

| | || | || ||

|

| ||| || || ||| | || | | | |

|

|| | ||| | |

|

| | || || | | || || | || || || ||| ||| | |

|

|

|

| ||| | || |||| | | | ||| | || |||| ||

|

| || | | || ||| || ||| ||||| |

0 500 1000 1500 2000 2500

0.0

0.2

0.4

0.6

0.8

1.0

Balance

Pro

babi

lity

of D

efau

lt

| | | ||| |||| | || | || || || ||| | | | ||| | ||||| | | || || ||| |||| | | ||| || || | || ||| || ||| | | || | || || | ||| | | || | | ||| || || | || || | ||| ||| | |||| | ||| | || | ||| | ||| || || || ||| |

|

|| || || || | ||| || | || || ||| ||| || || | || | | | |

|

| | ||| | |||| | | |||| | || || ||| || |

|

| || |

|

||

|

|| | || || | ||| | ||| | || ||| || | |||| | ||

|

|

|

| || | | | || | || || ||| || |

|

| ||| || | | ||| | || || | | |||| | || |||| ||| | || || | | | || ||| || || | || | || ||| | || || | || | || || | ||| ||

|

| ||

|

| | |

|

||| | || |

|

|| || || ||| || | |||| | | || |||| | |||| | | || | | | | || || || || | ||

|

| | || || || ||| |||||| || ||| || | || ||| | |

||

| || ||||| | ||| | ||| || | | || || | || ||||| ||| | ||| | || ||| |

|

| || | ||| || ||||| | | ||| ||| | | | ||| || | | ||| | | ||||| |||| | | || ||

|

| | ||

|

| | || | | | ||||| ||| || ||| || | || || |||

|

| || |

|

|| | | || | |||| | ||| || || | || || | | || | || || |||| || |||| || | || |||| || | ||| |

|

| || | |||| |

|

| || || || | |||| || | | | ||| ||| | || | || |||| | ||| || | ||| | || | |||| || || || || ||

|

| || |||| | | || | || ||| || | |||| | |

|

||||| | | | |||| | || || | ||

|

| || |||| | || || || | || |||| || | ||| || | | ||| | || ||||

|

| |||| || | | | || ||| || || ||||| | ||| |

|

||| | | ||| | | ||| | | ||| | || || |||| | |||| |

|

|| || | | || || | || ||| | || || || || ||| ||| || | | ||| ||||| | ||| | | || |

|

| ||| ||| | || |

|

|||| |||||| | |||| || | |||| ||||||| || |||| | || | ||

|

|| | || |

|

| ||| || | ||| || |||| |

|

| | ||| | | | || | || | ||||

|

| | ||

|

| || | | |||| | | | | ||| || |

|

|| | |||| | ||| | || || || | || || ||| || || | ||| | | || |||| || || | || | | | ||

|

|| | | || ||||| | ||| | || | || || | || | ||| | || || ||

|

|| || |

| |

|| | | || | ||||| | |||

|

| || | | | || | ||| | || || | || | | | |||| |||| || ||| | | |||| ||| || |

|

|| | ||

|

|| |||| | || | | ||| | ||| | | ||| | ||| | || | | || |||| |

|

|| |||| | || ||| ||| || | || | || |||| || || || | || || | |||| | || || || ||||| | |||| | |||| || || | || || || || || |||| || | | || || || ||| | | | |||

|

|

|

|| |||| | | | |||| | || ||| || | || | | | | |||| |

|

||| | || |||| || ||| | | ||| | || | | || || ||| ||| | | || | | || || || | |

|

|

|

| ||| | ||||||| | | || || |||| || | || || || ||| | |

|

||

|

| | || || ||

|

| ||| |

|

||| ||| | |||| || ||| || |||||| || | || | || || || || ||| ||| |

|

|| ||| || |||| | ||| || |||| || || | ||||| | | || ||| |||

|

| |||| || ||| ||| ||| | |

|

|||| ||| | ||| | |||

|

||| || | || || | | || | || | |||

|

|| | || | || || | || || | | |||| || | || | ||||| ||| || | | ||| ||| || || | || | |

|

|| ||| |

|

| || | | |||| | | || ||| ||| ||| ||| ||| | || || || | ||

|

|| ||| | || || || || | || |||| | | | || | || | |

|

|| | || | | || | | ||| || | |||| | || || || || | ||| ||| |||| || ||||

|

|||

|

|| |||| || || |

|

| || || || || || | | ||||

|

| ||| || || || || |

|

| |||| ||| || | ||| ||| ||

|

| || || | |||| || || ||| ||| || | || | | | ||| |||| | ||||||

|

|

|

| | || | || | | | ||| | | || || ||| | |||| || | || || | || || | || || ||| ||| | || ||| | | ||

|

| |||| ||

|

| | ||| | || ||| | || || | || || |

|

|| | ||| || || | ||| |

|

| ||| || ||| || | | |

|

||| | | |||| | ||| || || || ||| || ||

|

|| | |

|

| ||| | | | ||| |||| |||| | || ||| || |||

|

|| | | || |||| | | | || ||||| | | | | | ||| | || |

|

| | ||| | ||| ||| |

|

|| || ||

|

| || ||| ||| || | | || || | ||

|

|| |

|

| || | | | |||| |

|

| | || | ||| || || |||| | || || |

|

||| | || || ||| || | | ||||| || | | | |||| || || || ||| || ||

|

||| || || | | || ||| || | | | |||| || |||| || || |||| | |

|

|| || |||| ||||| | ||

|

| || ||| || | | || |

|

|| ||||| |||| |||| | | |||| | || | | || | || || ||| | ||| ||| | | ||| ||| | || | || ||| ||| || | || | ||| ||| | || || || | || || || || | |

|

|| ||| | |||||

|

|| | || || ||| ||| || | | | ||||| | | ||| || || || | ||| ||| | |||| || |||| | || || | ||| ||| || | || | | | ||| ||

|

| ||| | || | | | |

|

| | || || || | || || ||| || | || ||| || | ||| || || ||| | ||| || |||| | ||| ||

|

| |||| || | || | |||| || | || || || | ||| || || | || ||| ||| || |||| |

|

||| | || || || ||| | |||| || || || || | ||| | || | ||| |||

|

||||| | | | | ||| || ||| | || || ||| || || ||| || | ||| | ||| | | || | | || | || | |||

|

|| || || || || || ||| | | ||| || | || ||| |||

|

|| | || || | || | | |||

|

|| ||| | ||||| ||| | || | ||| || |||| || || | || || | || | || | || | |||| | || || ||| || | |||||| || || |||| ||| || || |||| | |

|

| || || || | ||| ||| |||| | || | | |||| || ||||| | || ||

|

| | || | || ||| |||| | || ||| || || | || ||| || || | ||||| | || | | | | || | || ||| |||| | | ||| |||| || | | ||| | ||| || | || | || | || ||| | | || | || || |||| | || |

|

|| ||| | |||| ||

|

|| | | || | ||| | || || || | |||| || ||| | || || | | || || | | || | | | ||||

|

|| | || || | || |||

|

| || | |

|

|| | ||| ||| || | ||| ||| | || || ||||| ||| |

|

| || ||

|

|| ||| | || | ||| ||| || |

|

|| | || ||

|

|| | | | | || ||| | ||| | || |||

|

| || ||| | ||| || | | ||| | || | || | | |

|

|| |||| | ||

|

|||| || || ||| || ||| ||| | || | || | ||| | ||||| |

|

|| || | ||| | ||

|

| || ||| ||

|

| | |||| |

|

|| || | |||

|

|| ||| | | ||||| | | | |||| ||| | | || ||| || | || || || | || |||| || | | |||

|

||

|

| | || |

|

|| | | |

|

|| || | ||||| ||| || | | || | | || | || ||| | ||| || | |||| | | |||

|

|| ||| || || || || ||| | | || || ||| | ||| ||| | || || || || || |||||

|

| | |

|

| | | |||| || ||| | ||| | || | ||| ||| | | ||||| | || | || | | ||| | | ||| | || |||| | | || | || || | || |||

|

| || |||| || | | | |||| ||| | || || || |||| | |||

|

||| || | || ||||| |

|

|| ||| | ||| || | | || | ||| | || ||| | ||

|

| || | | || | | || || | ||| || | || | ||||| || | | || | | || || || ||| | | || || | || || |||| || || | | || ||||

|

|| |

|

|| ||| | | | ||| || ||| || | ||| || || || | | | |||| | | || || || ||| || || ||| || || | || | || | |

|

||| | || | || || || | |||| || | || | | ||| |||| | | || | ||| | || | | ||| | || || | ||| |||| || ||| | | ||

|

| || | || || || ||| || ||| | | ||| | |

|

| ||||| || | ||| | || ||| || || | |||| ||||

|

|||| |||

|

| ||| | || ||| | | | |||| | |||| | || || | || | | || |

|

|| | | || | | | ||

|

| || ||| || | | || || || | ||| | ||| || | || | || | || || || || | ||| || | || | ||| || ||| ||| || ||| | || |||| || || |

|

| ||| ||| | | |

|

|| ||| ||| ||||

|

|| |

|

|

|

||| | ||| | ||| || || || ||| || | | |||| | ||

|

| || | || || || || | || | ||| |||| ||| || ||| |

|

||| || || | || || | |

|

|| || || |

|

|| || || ||| ||||| | | ||| | | || || ||| || | | | | ||||| | ||

|

| || || || ||| | | |

|

|| || ||

|

|| || | | ||| | ||| | | ||| | || |||| || | ||| | | | | || |||| | |||||| | | |||| || | || | ||| ||| | | | || ||| ||

|

|| |||| |||| | ||| | | || || || | | | ||| || | ||| | || || | || | | | || | |||| | | ||| |||||

|

|| | || ||| || || | | || ||| || ||| | ||| | || || | ||

|

|| || || ||

|

| || ||| | ||| ||

|

| || | | || | || ||| | || ||| | || | ||| || || | |

|

|| ||| || |||| ||

|

|| | |||| || | || ||| | | || |||| | | || ||| |

|

|||

|

| || | ||| ||||| || | ||| |||| | |||| |||| |||| | | | || || |||| ||| | || | |

|

|| || ||| ||| | || || | || | |

| |

|| |

|

| | || | ||| ||| || | | | || ||||| ||| | | ||||| |||| | || | || | ||| || ||||| || || || | | || ||| | |||| || | || |||||| || |

|

|

|

|| ||| ||| | | |||

|

| ||| || | |||

| |

| || | || | | | ||||| ||

|

|||| | ||| || | ||| ||| ||| | | | ||| | || | |||||| || || ||| | || |

|

|| |||| || || || || | | |||| || | || || || || || | | ||| || | | || | || || || | ||| | || ||| | |||| | ||||| || | || | | | ||| || | || | | || ||| | || || | || || | ||| || | || | ||| | ||| || ||| | || | || || || || | | || ||| | || | || | |||| | || | |

|

| ||| | || | ||

|

| ||

|

|| | | ||| || ||

|

|| | | || ||| |||| || | |

|

| || | || || ||| ||| ||| || || |||| | | ||||

|

| | | || ||| | | |||| || | | || || |

|

| |||| | || | |

|

| | |

|

| || | || | ||| | || || | || | || | | || || || | || || |||

|

| | | || | || ||| || | || ||| ||| || | ||| || | || | || ||| ||| || ||| | | || | || || |||| || | || || | |

|

|| ||||| || ||||| || | ||| || | | || |

|

| || | || || | || || | || |||

|

| ||| || |||| || || |||| || || || ||| | | | | ||| | ||||| || |||| | |||| || | ||| | || | ||| | || || || | || |

|

|| | || | | | || || || ||| |||| || ||||| | ||| | || || || ||| | || | | ||| | | | ||||| ||| | ||||| |

|

|||| ||| ||| ||| || || ||| | ||| || || ||| ||| || | | |||

|

| | |||| || || ||| | | ||| | || ||| || || || ||||

|

|| ||| ||

|

| | ||| | ||| ||| | ||||| | | |

|

||| | || ||| | | ||| || ||| ||| | || | || | || || ||| || || ||| |

|

|| || ||| | | ||| || ||| || | |||| ||| || | ||| || | | ||| | || || ||| | ||| || || ||| | || || || |

|

||| | || | ||| || ||| || |||| | | ||| | ||| || | |||| || | |||| |||| | | | ||| | || ||||| || || || | || | | || | |

|

||| | || || || ||

|

| || || |||

|

| | || | || | || ||| ||| | ||| | | ||| || || | || || || || |||| || ||| ||| | | ||| | |||| | |||| || ||| | || | || | || | | ||| || || | || ||||| | | |||| ||| ||

|

|| || || | ||| || | || || || ||| || ||| | | |||| ||| |

|

| ||| || || | ||| ||| ||| || || || || | | ||| | | |||

|

||| | | || || | | | | ||| | |||| || || | | ||| ||

|

||| ||| | || | || | | || ||| | ||| || || | ||| |||| |||| || | | |||| ||| || | || || | | || | | | || | | ||| | || ||| || || | || ||| || |

|

| || || | || || || | || ||| | || |||||| | | | | || | || |

|

|| | || | | || | | |

|

| || || ||| | | |||| | | | | || | ||| || || | || ||| || | || || |

|

|

|

| || ||| | | | || | || | ||| |

|

| || | | ||| || | | || | ||| | | | ||

|

|| | ||| | | ||| ||| |

|

| || || | ||| || |||| | || | | ||| | | || || || | ||||| || | ||

|

| | ||| | ||| |

|

|| | | | ||| || ||| | | || | || | ||| | | || | |||||| | || ||| || | | || | ||

|

|| ||| |

|

||| || || ||| |

|

| | | || || ||| | | | || || | | || || ||| | | || | || | ||| ||| | | ||| | | || | || | |||| | | || | || ||| | |||

| |

| || | || ||| ||| ||| | |||| | || ||| || || | ||| ||

|

| | || || | || | | |||| ||| || | |||| ||

|

| || | ||| | | ||||| | ||| | | |

|

| || | ||| || || || | | ||| || | || | |||

|

|| | ||| || | ||

|

|| | || || ||| | |

|

||| | || |||| | | || | || |||| || | | ||| ||| || | || | || ||| || |

|

| ||| || ||||| || | || || ||

|

|

|

| | | ||||| ||| | | || |||| | | | ||| | | ||| | || | || | |||| | || | | || || | |

|

| | ||

|

| ||||| || | |

|

| || || || || || || | || || | ||

|

| || || | ||

|

|||| | | || | ||| | || | ||| | | ||| || ||| | | ||| | || | || ||

|

| ||| || | || |||| ||| || | | || |||| | | | ||| || |||| | || | || | | ||| || | || ||| | |||| ||| |||

|

|| || | || || |||| | || |||| || || | |

|

| |||| || || ||

|

| || | || | |

|

|||| ||| | || | || ||||| | || || |||

|

|| || | ||| || | | | |||

|

||

|

| ||| | | ||||| | ||| ||| || |

|

|| | || ||||

|

|| | || || | || ||| | | ||| |||| | | | ||| | ||||| || | |||

|

|| | | || | || ||| || | ||| ||| ||| | ||| | | ||| | || ||| ||||| |

|

|||| ||| || | | || ||

|

||| | ||| | || ||||| | ||||| || | ||| |

|

|| | | |||| ||| | || |||| | | ||| | ||| | |||

|

|||| || ||||| | | ||| ||| | | | || || | | | || | ||| | || | | | ||| ||

|

|| || || || ||

|

||| ||| | ||| || ||| || || || |||| || | |||||| |

|

|| || | || ||||| | ||| || |||| || ||| | | |||| | |||| | || | ||| ||| | | || || | || |

|

| | || || ||| | || | ||| | | || | | || || ||| | | || || | | |||| | | || |

|

|| || | || ||| || || | | | ||| || | ||| || ||| || ||| | | ||| || || | ||| | || || | ||| || ||| || |||| | | | || || | | | |||| | | || |

|

|

|

|| | |||| | ||| || | |||| || ||| |

|

|||| | || ||||| | |||||| | || |||| || | || || | | || || | | || || || || | || ||| | | | || ||| ||

|

| | || ||| |

|

| ||| | || | | || | ||| | || || || || | ||| || | | ||| ||| || ||| ||| | |

|

| |

|

||| || ||| | | ||| || ||| || | ||| | | || ||| | ||

|

| | ||| | || ||| || | || || | |||| | | ||

|

|| |||| | |||| || | | ||||| || ||| | || || || || || | | ||||| | || ||| | ||| || || ||| | || || | | ||| ||| || | | | |||| || || || |||| | || ||| || || | ||| ||

|

|| | || || || | || | || ||| | |||| || || ||| | || ||| | | || ||| | ||| || ||| || | ||| || || |||| |||

|

|||| | || | | || ||| || | | ||| | | | || ||| ||| | ||| ||| | | || || ||| || | |

|

||||| |||| || ||

|

|| || | | | |||| |

|

|

|

|

|

|| | ||| | || | |

|

| ||| ||| || | || || |||| | | ||| | ||| | || || ||| | | ||||

|

| || || | || || ||| | | | || || ||| | || | | | ||| |

|

|| || | ||| | || || || ||| | || | | || | || || |||| || || | |||| ||| || | || || || |||| | || | |||| | | | ||| | | ||||| | || || ||| | | || || || ||| | | |||| | | |||| | | | || || | || || ||| | | || | || | | |||| | |||| | || | || | |||| || || || || || ||| || | || || || | || |||| || | || || | || | || || | ||| | | | || | | | ||| ||

|

|| || || || || | || | || || |

|

|||| | | | | ||| | || | || | | |||| | |||| | ||| | ||||||| | |||| ||| |

|

|||

|

| | || || ||| ||| || ||| || | | | | ||| | || || || || |||| | | ||| || ||

|

||||| || ||||| || ||| | ||

|

|| |||| | || | |||| | | |||| | || || | | || | | | || | | ||| || | ||

|

||| ||| || | | || || | || || | | || || | || | || |

|

| || | ||| | || | | |

|

| | ||| | |

|

| ||| || ||||| | ||||| || | | | ||||| | || | || |

|

| || | || || | |

|

| | || ||| || || || || |

|

| |||||| || || || || ||| ||| |||| | ||

|

||

|

||

| |

|| |||| | || | || | ||

|

|| ||| | || |

|

|| | ||

|

| | || ||| | || | ||| | ||| ||| || || || || || ||| ||| | ||| || |||

|

| || | | || ||||| |||| || | || || || |||| | |

|

| | ||| | ||| ||| | || | ||| || || ||||| ||| || | || | | | | |||| || | || ||||| | || || || | | | || || | | || || | ||| || |||| | || ||| | || | | ||| || | || | || | ||| || ||| || || || | ||| || | ||

|

| | | || | | ||| | |||| | | |

|

|| || | | || ||||

|

| | || | | || || || || | || ||| | | ||| ||| | | | | ||| ||| || | || |||| || | ||||

|

|| || || | || | | ||||| | || || | || | | || ||| || ||||| |

|

||

|

||| |||| || |||| ||| | || || ||| || | | || | ||| | || | | ||| || | | || || || ||| | |||| |||| ||| ||| |

|

||| | || || || |

|

||| || | || |||| | ||| ||| || ||| || ||| || || || | | || | | | || || | || || | | | ||| | | || || | ||| | | | ||

|

|||| | || | || | || || || |||| || || || | | ||| | || ||| | | || |||||

|

| |||

|

| || || || ||| | | |||| | | || || ||

|

| | ||| | ||

|

| |||

|

| | | |||| || || || ||| | || | | | | ||||| || ||| || | ||| || | || |||| |||| | | | || ||| || || ||| || | || || ||| || || |||| || | ||||| |

|

| | || | || || |||| | || || | || | | ||| | ||| | ||| ||| ||| | | || | | || | ||| || | | | || | || | | || || || ||| | | |

|

| || || | ||| | || | |

|

| ||| | |||| | || || ||| || | | | |

|

| | |||| ||| || || || || | || | ||| | | | |||| |

|

|| |||| | |||| | || | | || | |||| || || ||| | || ||| ||| |

|

| || || || | ||| ||| ||| | ||| ||| || | | || ||| | || ||| ||| || | |||| | || || || | || || |

|

| |||

|

| ||| || ||| |

|

| | ||| | |||

|

|| ||| | || || | | ||| || |

|

| ||| | ||

|

||| | |||| || || | |

|

| | |||| | ||

|

|| | |

|

|||| | |

|

| | ||| || | ||| | | ||

|

| | ||| |||| | || ||| || | || |||| | ||| | ||| | || || || | | | | || ||| | | |||| | | || || ||| || | |||| |

|

|| || ||| || || || ||| | ||| ||||| ||| || || | | || ||||| ||

|

| || |||| |||||| | | ||| || ||| ||| | || || ||| || || ||| ||| || || || || | | | | || | | || || || ||| | | || | || || || | | || || || || |||| | | || | | |||| || || ||||| | ||| || | | || | ||| |

|

| |

|

| ||| | ||| |

|

| | || || || || | | || ||

|

|| || | ||| | || | | ||| || ||| | | || || || |||| | || ||| | ||

|

|| || ||| | || || | || | |||| || | |||

|

| | || | || ||

|

| ||| || || ||| | || | | | |

|

|| | ||| | |

|

| | || || | | || || | || || || ||| ||| | |

|

|

|

| ||| | || |||| | | | ||| | || |||| ||

|

| || | | || ||| || ||| ||||| |

FIGURE 4.2. Classification using the Default data. Left: Estimated probabil-ity of default using linear regression. Some estimated probabilities are negative!The orange ticks indicate the 0/1 values coded for default(No or Yes). Right:Predicted probabilities of default using logistic regression. All probabilities liebetween 0 and 1.

For the Default data, logistic regression models the probability of default.For example, the probability of default given balance can be written as

Pr(default = Yes|balance).

The values of Pr(default = Yes|balance), which we abbreviatep(balance), will range between 0 and 1. Then for any given value of balance,a prediction can be made for default. For example, one might predictdefault = Yes for any individual for whom p(balance) > 0.5. Alterna-tively, if a company wishes to be conservative in predicting individuals whoare at risk for default, then they may choose to use a lower threshold, suchas p(balance) > 0.1.

4.3.1 The Logistic Model

How should we model the relationship between p(X) = Pr(Y = 1|X) andX? (For convenience we are using the generic 0/1 coding for the response).In Section 4.2 we talked of using a linear regression model to representthese probabilities:

p(X) = β0 + β1X. (4.1)

If we use this approach to predict default=Yes using balance, then weobtain the model shown in the left-hand panel of Figure 4.2. Here we seethe problem with this approach: for balances close to zero we predict anegative probability of default; if we were to predict for very large balances,we would get values bigger than 1. These predictions are not sensible, sinceof course the true probability of default, regardless of credit card balance,must fall between 0 and 1. This problem is not unique to the credit defaultdata. Any time a straight line is fit to a binary response that is coded as

Source: ISL3 / 32

Page 4: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Interpreting Logistic Regression Coeffs

Case 1: One continuous predictor X. The model is

log

{p(x)

1− p(x)

}= β0 + β1x — logit is linear in x

Here β0 and β1 are intercept and slope of the logit function.Thus, β1 is the change in logit when x ↑ by one unit. Since

β1 = β0 + β1(x+ 1)− β0 − β1x

= log

{p(x+ 1)

1− p(x+ 1)

}− log

{p(x)

1− p(x)

}

= log(odds ratio), where odds ratio =new odds

old odds,

it follows that new odds = exp(β1)× old odds.

β1 > 0: odds and hence p(x) increase with x

β1 < 0: odds and hence p(x) decrease with x

β1 = 0: p(x) is free of x — X is not useful for predicting Y

If p(x) is small, odds ratio ≈ p(x+ 1)/p(x) — relative risk

4 / 32

Page 5: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Case 2: One categorical predictor X with C levels, coded usingC − 1 indicator variables, Z1, . . . , ZC−1. The model is

log

{p(z)

1− p(z)

}= β0 + β1z1 + . . .+ βC−1zC−1.

As before, logit for base level = β0 and logit for level j =β0 + βj . Therefore,

βj = logit for level j− logit for base level

= log(odds ratio), where odds ratio =odds for level j

odds for base level,

meaning odds for level j = exp(βj)× odds for base level.

Odds and hence p(z) are larger than those for base ifβj > 0 and they are smaller than those for base if βj < 0

Odds ratio of level j vs k = exp(βj − βk)No effect of X: β1 = . . . = βC−1 = 0 (simultaneously)

If p(z) is small, odds ratio ≈ relative risk

5 / 32

Page 6: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Ex: Suppose Y = indicator of lung cancer (1 = yes, 0 =no) and X = smoking history with three levels —never-smoker (base), occasional smoker (z1), and serious smoker(z2). We also know that proportion p of lung cancer patients inthe general population is small. For example, in 2014, therewere an estimated 527,228 people living with lung cancer in theUS. Therefore, we can interpret an odds ratio as a relative risk.Now, assume that (β1, β2) = (1.1, 3.0). Thus:

odds ratio for occasional vs never smokers =exp(1.1) = 3 =⇒ an occasional smoker is three times morelikely than a never-smoker to get lung cancer.

odds ratio for serious vs never smokers =exp(3.0) = 20 =⇒ a serious smoker is 20 times more likelythan a never-smoker to get lung cancer.

odds ratio for serious vs occasional smokers =exp(3.0− 1.1) = 6.7 =⇒ a serious smoker is about 7 timesmore likely than an occasional smoker to get lung cancer.

6 / 32

Page 7: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Case 3: Multiple predictors X1 . . . , Xp — continuous orcategorical (represented using indicators) and their interactions.In this case, we proceed in the same way as in the case oflinear model except that

logit{p(x)} = xTβ,

rather than E(Y |x) = xTβ is the model equation, and

p(x) =exp(xTβ)

1 + exp(xTβ)

7 / 32

Page 8: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Decision Rule Based on Logistic Regression

Estimate β from training data and plug in to getlogit{p(x)} = xT β and p(x).

Estimated Bayes classifier predicts class 1 if p(x) ≥ 0.5, orequivalently, logit{p(x)} ≥ 0 (because logit(0.5) = 0)

Decision boundary:

{x : p(x) = 0.5} ≡ {x : logit{p(x)} = xT β = 0}

The decision boundary is linear — just like LDA

Differs with LDA only in fitting — maximum likelihood forlogistic regression and method of moments assumingnormality for LDA

With quantitative predictors, logistic regression and LDAtend to give similar classification performance

May use cutoffs other than 0.5 for p(x) to get specifiedsensitivity and specificity performance

8 / 32

Page 9: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Pros and Cons of Logistic Regression

Pros:

Can be used for both inference (e.g., to select usefulpredictors) and prediction (whereas LDA and QDA aredesigned only for prediction)

Works with both quantitative and qualitative predictors(although LDA and QDA are also often used withqualitative predictors)

Does not have any distributional assumptions for predictors(whereas LDA and QDA work under normality assumption,which makes sense only for quantitative predictors)

Cons:

Unstable if classes are well-separated (in fact, it will fail incase of perfect separation) or if n is small (whereas LDAand QDA do not have this issue)

Can be generalized to K > 2 but LDA and QDA are morecommon in this case

9 / 32

Page 10: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Comparison of Various Classifiers

Classifiers: KNN, LDA, QDA, and logistic regression

KNN is nonparametric (the others have assumptions)

Logistic regression can be used for inference

No single method dominates all others in every situation.When the decision boundary is linear, LDA and logisticregression tend to work well. When the true boundary ismoderately non-linear, QDA may be better. For morecomplicated boundaries, a non-parametric approach suchas KNN with optimal K may be better.

10 / 32

Page 11: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Takeaways

Logistic regression: Model p(x) directly by assuming alinear model for its logit.

Use for both inference and prediction

Linear decision boundary

Can deal with non-linear decision boundary by usingnon-linear transformation of X as predictor

Use cross-validation to estimate its test error rate

Common for binary classification

11 / 32

Page 12: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Part III:

Tree-Based Methods

12 / 32

Page 13: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Tree-Based MethodsSetup: Regression or classification of a response Y based on ppredictors X1, . . . , Xp denoted by X

Regression: Y is quantitative, f(X) = E(Y |X), Y = f(X),and accuracy is measured by residual sum of squares,RSS =

∑ni=1(Yi − Yi)2.

Data: (Yi, Xi), i = 1, . . . , n from n independent subjects

We will learn some decision-tree methods. In particular:

Classification and Regression Trees (CART)algorithm to grow a tree — divide the predictor space intoa set of rectangles (or boxes) and fit a constant in each box

Bagging and random forests (and boosting; notcovered here) procedures that grow multiple trees andcombine them to get a single consensus tree. Thesemethods are competitive with other methods we have seen.

We will begin with regression trees.13 / 32

Page 14: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Some tree terminology: The tree is drawn upside down sothat the leaves are at the bottom and the root is at the top.

Internal nodes: points along the tree where the predictorspace is split

Branches: segments of the trees that connect the nodes

Terminal nodes: leaves — each terminal node isassociated with a region (subset) of the predictor space andtogether they form a partition of the predictor space

Binary tree: each internal node is split into two branches

Size of a tree: # terminal nodes

Size of a terminal node: # of training observationsfalling in the corresponding region of the predictor space

Ex: Hitters data: These consist of data on Salary ofn = 263 baseball players. We consider 9 predictors, includingYears and Hits. The goal is to predict Salary. Since Salary isright-skewed, we will work with log(Salary) as the response.

14 / 32

Page 15: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

304 8. Tree-Based Methods

|Years < 4.5

Hits < 117.55.11

6.00 6.74

FIGURE 8.1. For the Hitters data, a regression tree for predicting the logsalary of a baseball player, based on the number of years that he has played inthe major leagues and the number of hits that he made in the previous year. At agiven internal node, the label (of the form Xj < tk) indicates the left-hand branchemanating from that split, and the right-hand branch corresponds to Xj ≥ tk.For instance, the split at the top of the tree results in two large branches. Theleft-hand branch corresponds to Years<4.5, and the right-hand branch correspondsto Years>=4.5. The tree has two internal nodes and three terminal nodes, orleaves. The number in each leaf is the mean of the response for the observationsthat fall there.

8.1.1 Regression Trees

In order to motivate regression trees, we begin with a simple example.regressiontree

Predicting Baseball Players’ Salaries Using Regression Trees

We use the Hitters data set to predict a baseball player’s Salary based onYears (the number of years that he has played in the major leagues) andHits (the number of hits that he made in the previous year). We first removeobservations that are missing Salary values, and log-transform Salary sothat its distribution has more of a typical bell-shape. (Recall that Salary

is measured in thousands of dollars.)Figure 8.1 shows a regression tree fit to this data. It consists of a series

of splitting rules, starting at the top of the tree. The top split assignsobservations having Years<4.5 to the left branch.1 The predicted salary

1Both Years and Hits are integers in these data; the tree() function in R labelsthe splits at the midpoint between two adjacent values.

Source: ISL15 / 32

Page 16: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

8.1 The Basics of Decision Trees 305

Years

Hits

1

117.5

238

1 4.5 24

R1

R3

R2

FIGURE 8.2. The three-region partition for the Hitters data set from theregression tree illustrated in Figure 8.1.

for these players is given by the mean response value for the players inthe data set with Years<4.5. For such players, the mean log salary is 5.107,and so we make a prediction of e5.107 thousands of dollars, i.e. $165,174, forthese players. Players with Years>=4.5 are assigned to the right branch, andthen that group is further subdivided by Hits. Overall, the tree stratifiesor segments the players into three regions of predictor space: players whohave played for four or fewer years, players who have played for five or moreyears and who made fewer than 118 hits last year, and players who haveplayed for five or more years and who made at least 118 hits last year. Thesethree regions can be written as R1 ={X | Years<4.5}, R2 ={X | Years>=4.5,Hits<117.5}, and R3 ={X | Years>=4.5, Hits>=117.5}. Figure 8.2 illustratesthe regions as a function of Years and Hits. The predicted salaries for thesethree groups are $1,000×e5.107 =$165,174, $1,000×e5.999 =$402,834, and$1,000×e6.740 =$845,346 respectively.

In keeping with the tree analogy, the regions R1, R2, and R3 are knownas terminal nodes or leaves of the tree. As is the case for Figure 8.1, decision

terminalnode

leaf

trees are typically drawn upside down, in the sense that the leaves are atthe bottom of the tree. The points along the tree where the predictor spaceis split are referred to as internal nodes. In Figure 8.1, the two internal

internal nodenodes are indicated by the text Years<4.5 and Hits<117.5. We refer to thesegments of the trees that connect the nodes as branches.

branchWe might interpret the regression tree displayed in Figure 8.1 as follows:

Years is the most important factor in determining Salary, and players withless experience earn lower salaries than more experienced players. Giventhat a player is less experienced, the number of hits that he made in theprevious year seems to play little role in his salary. But among players who

Sides of the rectangles will be parallel to the axes.

Source: ISL16 / 32

Page 17: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

308 8. Tree-Based Methods

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X

Y

2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

FIGURE 8.3. Top Left: A partition of two-dimensional feature space that couldnot result from recursive binary splitting. Top Right: The output of recursivebinary splitting on a two-dimensional example. Bottom Left: A tree correspondingto the partition in the top right panel. Bottom Right: A perspective plot of theprediction surface corresponding to that tree.

Therefore, a better strategy is to grow a very large tree T0, and thenprune it back in order to obtain a subtree. How do we determine the best prune

subtreeway to prune the tree? Intuitively, our goal is to select a subtree thatleads to the lowest test error rate. Given a subtree, we can estimate itstest error using cross-validation or the validation set approach. However,estimating the cross-validation error for every possible subtree would be toocumbersome, since there is an extremely large number of possible subtrees.Instead, we need a way to select a small set of subtrees for consideration.

Cost complexity pruning—also known as weakest link pruning—gives uscostcomplexitypruning

weakest linkpruning

a way to do just this. Rather than considering every possible subtree, weconsider a sequence of trees indexed by a nonnegative tuning parameter α.

Source: ISL 17 / 32

Page 18: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Growing a Regression Tree

Partition the predictor space into J distinct andnon-overlapping regions, R1, . . . , RJ , each corresponding to aterminal node j = 1, . . . , J . Model the response as a constant cjin each Rj . This gives the mean function

f(X) =

J∑

j=1

cjI(X ∈ Rj).

Criterion: Minimize wrt Rj and cj , j = 1, . . . , J ,

RSS =

n∑

i=1

(Yi − f(Xj))2 =

J∑

j=1

i:Xi∈Rj

(Yi − cj)2 =

J∑

j=1

njQj ,

where nj = # observations in Rj , and

Qj =1

nj

i:Xi∈Rj

(Yi − cj)2

is a measure of impurity of node j.18 / 32

Page 19: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

For given Rj , the minimum wrt cj occurs when

cj = ave{Yi|Xi ∈ Rj},implying that Qj is simply the mean squared error (MSE) ofnode j. Thus, the problem reduces to that of finding the bestregions. This is not computationally feasible since it involvespartitioning the predictor space into all possible regions whichcan have any shape. So we restrict attention to the regions thatare rectangles (or boxes). However, even this is notcomputationally feasible. Therefore, we take a recursivebinary splitting approach that:

is top-down — begins at the top of the tree where allobservations belong to a single region,

successively splits the predictor space into two regions —each split makes two new branches down the tree, and

is greedy — makes the split that is the best at thatparticular step rather than looking ahead and picking asplit that may lead to a better tree in some future step.

19 / 32

Page 20: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Fitting algorithm: Start with all the data.

Find the best splitting variable m and split point s, givingthe best split at the top of the tree.

Repeat the process, looking for the best split in each of thetwo regions identified in the previous step.

Continue until a stopping criterion is reached, e.g., size ofterminal nodes fall below 5, e.g.

Once the regions R1, . . . , RJ have been created, thepredicted response Y for a given observation X is simplythe average of the training observations in the region inwhich X falls, i.e.,

Y = ave{Yi|X ∈ Rj}.

Issue: The resulting tree may be too complex — too manysplits (i.e., too many regions), overfitting the data.

20 / 32

Page 21: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Cost-complexity pruning: Grow a very large tree T0 andprune it to obtain a subtree T ⊆ T0. Pruning means collapsingsome internal nodes. Let |T | denote the size of the subtree T —# of terminal nodes (or leaves) in T , and α (≥ 0) be a tuningparameter. For a given α, obtain the subtree that minimizesthe objective function:

RSS + α|T |.

A penalized RSS criterion

α = 0 =⇒ T = T0 (i.e., no pruning)

α =∞ =⇒ |T | = 1 (i.e., only one region with all data)

Larger α =⇒ smaller subtree

Have one subtree for each value of α — Tα

Use the weakest link pruning algorithm to obtain Tα

Choose α by cross-validation

21 / 32

Page 22: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

8.1 The Basics of Decision Trees 309

Algorithm 8.1 Building a Regression Tree

1. Use recursive binary splitting to grow a large tree on the trainingdata, stopping only when each terminal node has fewer than someminimum number of observations.

2. Apply cost complexity pruning to the large tree in order to obtain asequence of best subtrees, as a function of α.

3. Use K-fold cross-validation to choose α. That is, divide the trainingobservations into K folds. For each k = 1, . . . , K:

(a) Repeat Steps 1 and 2 on all but the kth fold of the training data.

(b) Evaluate the mean squared prediction error on the data in theleft-out kth fold, as a function of α.

Average the results for each value of α, and pick α to minimize theaverage error.

4. Return the subtree from Step 2 that corresponds to the chosen valueof α.

For each value of α there corresponds a subtree T ⊂ T0 such that

|T |!

m=1

!

i: xi∈Rm

(yi − yRm)2 + α|T | (8.4)

is as small as possible. Here |T | indicates the number of terminal nodesof the tree T , Rm is the rectangle (i.e. the subset of predictor space) cor-responding to the mth terminal node, and yRm is the predicted responseassociated with Rm—that is, the mean of the training observations in Rm.The tuning parameter α controls a trade-off between the subtree’s com-plexity and its fit to the training data. When α = 0, then the subtree Twill simply equal T0, because then (8.4) just measures the training error.However, as α increases, there is a price to pay for having a tree withmany terminal nodes, and so the quantity (8.4) will tend to be minimizedfor a smaller subtree. Equation 8.4 is reminiscent of the lasso (6.7) fromChapter 6, in which a similar formulation was used in order to control thecomplexity of a linear model.

It turns out that as we increase α from zero in (8.4), branches get prunedfrom the tree in a nested and predictable fashion, so obtaining the wholesequence of subtrees as a function of α is easy. We can select a value ofα using a validation set or using cross-validation. We then return to thefull data set and obtain the subtree corresponding to α. This process issummarized in Algorithm 8.1.

Splitting a qualitative predictor: Assign one or morecategories to one branch and the remaining to the other.

22 / 32

Page 23: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Hitters data

Use nine features

Equal split in training and test sets

Grow a large unpruned true on the training set

Apply cost-complexity pruning — vary α to create subtreeswith different sizes (i.e., the number of terminal nodes)

Perform six-fold CV (six is a factor of 132) to estimate theCV error of the subtrees as a function of α

Compute the test error as a function of α

Plot results as a function of size of the subtree instead of α

23 / 32

Page 24: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

310 8. Tree-Based Methods

|Years < 4.5

RBI < 60.5

Putouts < 82

Years < 3.5

Years < 3.5

Hits < 117.5

Walks < 43.5Runs < 47.5

Walks < 52.5

RBI < 80.5Years < 6.5

5.487

6.407 6.549

4.622 5.1835.394 6.189

6.015 5.571

6.459 7.0077.289

FIGURE 8.4. Regression tree analysis for the Hitters data. The unpruned treethat results from top-down greedy splitting on the training data is shown.

Figures 8.4 and 8.5 display the results of fitting and pruning a regressiontree on the Hitters data, using nine of the features. First, we randomlydivided the data set in half, yielding 132 observations in the training setand 131 observations in the test set. We then built a large regression treeon the training data and varied α in (8.4) in order to create subtrees withdifferent numbers of terminal nodes. Finally, we performed six-fold cross-validation in order to estimate the cross-validated MSE of the trees asa function of α. (We chose to perform six-fold cross-validation because132 is an exact multiple of six.) The unpruned regression tree is shownin Figure 8.4. The green curve in Figure 8.5 shows the CV error as afunction of the number of leaves,2 while the orange curve indicates thetest error. Also shown are standard error bars around the estimated errors.For reference, the training error curve is shown in black. The CV erroris a reasonable approximation of the test error: the CV error takes on its

2Although CV error is computed as a function of α, it is convenient to display theresult as a function of |T |, the number of leaves; this is based on the relationship betweenα and |T | in the original tree grown to all the training data.

Source: ISL

24 / 32

Page 25: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

8.1 The Basics of Decision Trees 311

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Tree Size

Mea

n S

quar

ed E

rror

TrainingCross−ValidationTest

FIGURE 8.5. Regression tree analysis for the Hitters data. The training,cross-validation, and test MSE are shown as a function of the number of termi-nal nodes in the pruned tree. Standard error bands are displayed. The minimumcross-validation error occurs at a tree size of three.

minimum for a three-node tree, while the test error also dips down at thethree-node tree (though it takes on its lowest value at the ten-node tree).The pruned tree containing three terminal nodes is shown in Figure 8.1.

8.1.2 Classification Trees

A classification tree is very similar to a regression tree, except that it isclassificationtreeused to predict a qualitative response rather than a quantitative one. Re-

call that for a regression tree, the predicted response for an observation isgiven by the mean response of the training observations that belong to thesame terminal node. In contrast, for a classification tree, we predict thateach observation belongs to the most commonly occurring class of trainingobservations in the region to which it belongs. In interpreting the results ofa classification tree, we are often interested not only in the class predictioncorresponding to a particular terminal node region, but also in the classproportions among the training observations that fall into that region.

The task of growing a classification tree is quite similar to the task ofgrowing a regression tree. Just as in the regression setting, we use recursivebinary splitting to grow a classification tree. However, in the classificationsetting, RSS cannot be used as a criterion for making the binary splits.A natural alternative to RSS is the classification error rate. Since we plan

classificationerror rateto assign an observation in a given region to the most commonly occurring

class of training observations in that region, the classification error rate issimply the fraction of the training observations in that region that do notbelong to the most common class:

CV error is minimized for a 3-node tree

Test error is minimized for a 10-node tree

Source: ISL25 / 32

Page 26: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

304 8. Tree-Based Methods

|Years < 4.5

Hits < 117.55.11

6.00 6.74

FIGURE 8.1. For the Hitters data, a regression tree for predicting the logsalary of a baseball player, based on the number of years that he has played inthe major leagues and the number of hits that he made in the previous year. At agiven internal node, the label (of the form Xj < tk) indicates the left-hand branchemanating from that split, and the right-hand branch corresponds to Xj ≥ tk.For instance, the split at the top of the tree results in two large branches. Theleft-hand branch corresponds to Years<4.5, and the right-hand branch correspondsto Years>=4.5. The tree has two internal nodes and three terminal nodes, orleaves. The number in each leaf is the mean of the response for the observationsthat fall there.

8.1.1 Regression Trees

In order to motivate regression trees, we begin with a simple example.regressiontree

Predicting Baseball Players’ Salaries Using Regression Trees

We use the Hitters data set to predict a baseball player’s Salary based onYears (the number of years that he has played in the major leagues) andHits (the number of hits that he made in the previous year). We first removeobservations that are missing Salary values, and log-transform Salary sothat its distribution has more of a typical bell-shape. (Recall that Salary

is measured in thousands of dollars.)Figure 8.1 shows a regression tree fit to this data. It consists of a series

of splitting rules, starting at the top of the tree. The top split assignsobservations having Years<4.5 to the left branch.1 The predicted salary

1Both Years and Hits are integers in these data; the tree() function in R labelsthe splits at the midpoint between two adjacent values.

Source: ISL26 / 32

Page 27: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Growing a Classification Tree

Setup: Same as before but Y is a qualitative response withK classes, indexed by k = 1, . . . ,K.

Recall for a regression setup: RSS =∑J

j=1 njQj , where thenode impurity measure Qj is the MSE in node j. This Qj wasused for both growing a tree and pruning it.

Class probability: For a terminal node j, representing theregion Rj with nj observations, let

pjk =1

nj

i:xi∈Rj

I(Yi = k)

be the proportion of class k observations in region Rj .

Class prediction: The observations in region Rj are classifiedto the majority class, i.e., the class k for which pjk ismaximum.

27 / 32

Page 28: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Node impurity measures for classification: For the jthterminal node, j = 1, . . . , J , we have:

Misclassification error: 1−maxk pjk — the proportion ofobservations in Rj that do not belong to the majority class

Gini index:∑

k 6=l pjkpjl =∑K

k=1 pjk(1− pjk)

Cross-entropy (or deviance): −∑Kk=1 pjk log(pjk)

The measures are ≥ 0 and take values near zero when oneof the pjk is close to one and the others are close to zero

A near-zero value indicates that the node predominantlyconsists of observations from a single class

The last two tend to be similar and are more sensitive thanthe first one to changes in tree structures.

Use either Gini index or deviance for tree growing andmisclassification error for pruning (if the final tree will beused for prediction)

28 / 32

Page 29: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Ex: Heart data: The data contain a binary outcome HD —indicator of heart disease (Yes or No) for 303 patients whopresented with chest pain. There are 13 predictors, includingAge, Sex, and Chol. Cross-validation results in a tree with sixterminal nodes.

29 / 32

Page 30: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

8.1 The Basics of Decision Trees 313

|Thal:a

Ca < 0.5

MaxHR < 161.5

RestBP < 157Chol < 244

MaxHR < 156MaxHR < 145.5

ChestPain:bc

Chol < 244 Sex < 0.5

Ca < 0.5

Slope < 1.5

Age < 52 Thal:b

ChestPain:a

Oldpeak < 1.1

RestECG < 1

NoNo

YesNo

No YesNo No No Yes

No

No Yes

Yes No Yes YesYes

5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Tree Size

Err

or

TrainingCross−ValidationTest

|Thal:a

Ca < 0.5

MaxHR < 161.5 ChestPain:bc

Ca < 0.5

No No

No Yes

Yes Yes

FIGURE 8.6. Heart data. Top: The unpruned tree. Bottom Left: Cross-validation error, training, and test error, for different sizes of the pruned tree.Bottom Right: The pruned tree corresponding to the minimal cross-validationerror.

assigning the remaining to the other branch. In Figure 8.6, some of the in-ternal nodes correspond to splitting qualitative variables. For instance, thetop internal node corresponds to splitting Thal. The text Thal:a indicatesthat the left-hand branch coming out of that node consists of observationswith the first value of the Thal variable (normal), and the right-hand nodeconsists of the remaining observations (fixed or reversible defects). The textChestPain:bc two splits down the tree on the left indicates that the left-handbranch coming out of that node consists of observations with the secondand third values of the ChestPain variable, where the possible values aretypical angina, atypical angina, non-anginal pain, and asymptomatic.

Source: ISL

30 / 32

Page 31: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Issue: Some terminal nodes in the tree have the samepredicted value, see, e.g., the split RestECG < 1 on the bottomright of the unpruned tree. Thus, regardless of whether or notRestECG < 1, the predicted outcome is the same — Yes.

Q: If this split is not performed, will it change themisclassification error rate? A: No

Q: Then, why is the split performed at all?A: It leads togreater node purity. In particular, we can see that:

all 9 of the observations corresponding to the right-handleaf have the response Yes. If a test observation falls here,we can be pretty certain of the outcome.

7/11 of the observations corresponding to the left-hand leafhave the response Yes. If a test observation falls here, theresponse is probably Yes, but we are less certain.

The split is performed as it improves Gini index or deviance,which is what we generally use for growing a tree.

31 / 32

Page 32: Logistic Regression for Binary Datapankaj/... · Interpreting Logistic Regression Coe s Case 1: One continuous predictor X. The model is log ˆ p(x) 1 p(x) ˙ = 0 + 1x| logit is linear

Pros and Cons of Trees

Why binary splits? We can consider multiway splits andalso splits into more than two groups but this will fragmentthe data too quickly, leaving insufficient data for the nextlevel down.

Trees are easy to explain and they provide a simple displayto visualize the data (even easier than linear regression)

Trees can handle qualitative predictors without the need tocreate dummy variables

Trees tend to be unstable — a small change in the data canresult in a very different tree

Generally, trees have less predictive accuracy than theother regression and classification approaches we have seen.

Techniques of bagging and random forest (and alsoboosting) alleviate the problems of stability and predictiveaccuracy.

32 / 32