Lecture 7 Statistical Lecture- Tree Construction.

62
Lecture 7 Statistical Lecture- Tree Construction

Transcript of Lecture 7 Statistical Lecture- Tree Construction.

Page 1: Lecture 7 Statistical Lecture- Tree Construction.

Lecture 7

Statistical Lecture-

Tree Construction

Page 2: Lecture 7 Statistical Lecture- Tree Construction.

Example

• Yale Pregnancy Outcome Study

• Study subject: women made a first prenatal (產檢 ) visit between May 12, 1980, and March, 12, 1980

• Total number: 3861

• Question: Which pregnant women are most at risk of preterm deliveries?

Page 3: Lecture 7 Statistical Lecture- Tree Construction.

Variable Label Type Range/levels

================================================

Maternal age x1 Continuous 13-46

Marital status x2 Nominal Currently married,

widowed, never married

Race x3 Nominal White, Black, Hispanic,

Asian, Others

Marijuana use x4 Nominal Yes, no

Times of using x5 Ordinal >=5, 3-4, 2, 1 (daily)

marijuana 4-6, 1-3 (weekly)

2-3, 1, <1 (monthly)

Years of education x6 Continuous 4-27

Employment x7 Nominal Yes, no

Page 4: Lecture 7 Statistical Lecture- Tree Construction.

Variable Label Type Range/levels===============================================Smoker x8 Nominal Yes, noCigarettes smoked x9 Continuous 0-66Passive smoking x10 Nominal Yes, noGravidity x11 Ordinal 1-10Hormones/DES x12 Nominal None, hormones, DES,used by mother both, uncertainAlcohol (oz/day) x13 Ordinal 0-3Caffeine (mg) x14 Continuous 12.6-1273Parity x15 Ordinal 0-7

Page 5: Lecture 7 Statistical Lecture- Tree Construction.
Page 6: Lecture 7 Statistical Lecture- Tree Construction.

The Elements of Tree Construction (1)

• Three layers of nodes• The first layer is the unique root node, the circle

on the top• One internal (the circle) node is in the second

layer, and three terminal (the boxes) nodes are respectively in the second and third layers

• The root node can also be regarded as an internal node

Page 7: Lecture 7 Statistical Lecture- Tree Construction.

The Elements of Tree Construction (2)

• Both the root and the internal nodes are partitioned into two nodes in the next layer that are called left and right daughter nodes

• The terminal nodes does not have offspring nodes

Page 8: Lecture 7 Statistical Lecture- Tree Construction.

Three Questions

• What are the contents of the nodes?

• Why and how is a parent node split into two daughter nodes?

• When do we declare a terminal node?

Page 9: Lecture 7 Statistical Lecture- Tree Construction.

Node (1)

• The root node contains a sample of subjects which the tree is grown

• Those subjects constitute the so-called learning sample

• The learning sample can be the entire study sample or a subset of it

• From our example, the root node contains all 3861 pregnant women

Page 10: Lecture 7 Statistical Lecture- Tree Construction.

Node (2)

• All nodes in the same layer constitute a partition of the root node

• The partition becomes finer and finer as the layer gets deeper and deeper

• Every node in a tree is a subset of the learning sample

Page 11: Lecture 7 Statistical Lecture- Tree Construction.

Example (1)

• Let a dot denote a preterm delivery and a circle stand for a term delivery

• The two coordinates represent two covariates, x1 (age) and x13 (the amount of alcohol drinking)

• We can draw two line segments to separate the dots from the circles

Page 12: Lecture 7 Statistical Lecture- Tree Construction.

Example (2)

Three disjoint regions

(I) x13 c2

(II) x13>c2 and x1 c1

(III) x13>c2 and x1 >c1

Partition I is not divided by x1, and partition I and II are identical in response but described differently.

Page 13: Lecture 7 Statistical Lecture- Tree Construction.

Recursive Partitioning (1)

• The purpose is that the terminal nodes are homogeneous in the sense that they contain either dots or circles

• The two internal nodes are heterogeneous because they contain both dots and circles

• All pregnant women older than a certain age and drinking more than a certain amount of alcohol daily deliver preterm infants

• This would show an ideal association of preterm delivery to the age and alcohol consumption

Page 14: Lecture 7 Statistical Lecture- Tree Construction.

Recursive Partitioning (2)

• Complete homogeneity of terminal nodes is an ideal

• The numerical objective of partitioning is to make the contents of the terminal nodes as homogeneous as possible

• Node impurity can indicate the extent of node homogeneity

Page 15: Lecture 7 Statistical Lecture- Tree Construction.

Recursive Partitioning (3)

node in the women ofnumber Total

node ain delivery preterm a having women ofNumber

The closer this ratio is to 0 or 1, the more homogeneous is the node

Page 16: Lecture 7 Statistical Lecture- Tree Construction.

Splitting a Node

• We focus on the root node, since the same process applies to the partition of any node

• All allowable splits are considered for the predictor variables

• The continuous variables should be appropriately categorized

Page 17: Lecture 7 Statistical Lecture- Tree Construction.

Example (1)

• The variable of age has 32 distinct values in the range of 13 to 46

• It may result in 32-1=31 allowable splits• One split can be whether or not age is more

than 35 years (x1>35)• For an ordinal or a continuous predictor, the

number of allowable splits is one fewer than the number of its distinctly observed values

Page 18: Lecture 7 Statistical Lecture- Tree Construction.

Example (2)

• There are 153 different levels of daily caffeine intake ranging from 0 to 1273

• We can split the root node in 152 different ways• Splitting nominal predictors are more complicated• Any nominal variable that has k levels contributes

2k-1-1 allowable splits

• x3 denotes 5 ethnic groups which will have 25-1-1 allowable splits

Page 19: Lecture 7 Statistical Lecture- Tree Construction.

Left daughter node Right daughter node

White Black,Hispanic,Asian,Others

Black White,Hispanic,Asian,Others

Hispanic White,Black,Asian,Others

Asian White,Black,Hispanic,Others

White,Black Hispanic,Asian,Others

White,Hispanic Black,Asian,Others

White,Asian Black,Hispanic,Others

Black,Hispanic White,Asian,Others

Black,Asian White, Hispanic,Others

Hispanic,Asian White,Black,Others

Black,Hispanic,Asian White,Others

White,Hispanic,Asian Black,Others

White,Black,Asian Hispanic,Others

White,Black,Hispanic Asian,Others

White,Black,Hispanic,Asian Others

Page 20: Lecture 7 Statistical Lecture- Tree Construction.

Example (3)

• We have 347 possible ways to divide the root into two subnodes from 15 predictors

• The total number of the allowable splits is usually not small

• How do we select one or several preferred splits?

• We must define the goodness of a split

Page 21: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (1)

• What we want is a split that results in two pure (or homogeneous) daughter nodes

• The goodness of a split must weigh the homogeneities (or the impurities) in the two daughter nodes

Page 22: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (2)

Take age as a tentative splitting covariate and

consider its cutoff at c, as a result of the

question “Is x1>c?” Term Preterm

Left Node (τL) x1≦c n11 n12 n1.

Right Node (τR) x1>c n21 n22 n2.

n.1 n.2

Page 23: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (3)

• Let Y=1 if a woman has a preterm delivery and Y=0 otherwise

• We estimate P{Y=1|τL}=n12/n1. and P{Y=1|τR}=n22/n2.

• The notion of entropy impurity in the left daughter node is

Page 24: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (4)

• The notion of entropy impurity in the left daughter node is

• The impurity in the right daughter node is

)log()log()(.1

12

.1

12

.1

11

.1

11

n

n

n

n

n

n

n

ni L

)log()log()(.2

22

.2

22

.2

21

.2

21

n

n

n

n

n

n

n

ni R

Page 25: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (5)

The goodness of a split, s, is measured by

where τ is the parent of τL and τR, and P{τL}

and P{τR} are the probabilities that a subject falls

into nodes τL and τR.

)(}{)(}{)(),( RRLL iPiPisI

Page 26: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (5)

.2.1

.1}{nn

nP L

.2.1

.2}{nn

nP R

Page 27: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (6)

If we take c=35 as the age threshold,

Term Preterm

Left Node (τL) 3521 198 3719

Right Node (τR) 135 7 142

3656 205 3861

Page 28: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (7)

0001.0),(

20753.0

)3861

3656log(

3861

3656)

3861

205log(

3861

205)(

,1964.0)(

sI

i

i R

2079.0

)3719

198log(

3719

198)

3719

3521log(

3719

3521)(

Li

Page 29: Lecture 7 Statistical Lecture- Tree Construction.

Split Impurity 1000*Goodness

value Left node Right node of the split (1000 Δ)

13 0.00000 0.20757 0.01

14 0.00000 0.20793 0.14

15 0.31969 0.20615 0.17

16 0.27331 0.20583 0.13

17 0.27366 0.20455 0.23

18 0.31822 0.19839 1.13

19 0.30738 0.19508 1.40

20 0.28448 0.19450 1.15

21 0.27440 0.19255 1.15

22 0.26616 0.18965 1.22

Page 30: Lecture 7 Statistical Lecture- Tree Construction.

Split Impurity 1000*Goodness

value Left node Right node of the split (1000 Δ)

23 0.25501 0.18871 1.05

24 0.25747 0.18195 1.50

25 0.24160 0.18479 0.92

26 0.23360 0.18431 0.72

27 0.22750 0.18344 0.58

28 0.22109 0.18509 0.37

29 0.21225 0.19679 0.06

30 0.20841 0.20470 0.00

31 0.20339 0.22556 0.09

32 0.20254 0.23871 0.18

Page 31: Lecture 7 Statistical Lecture- Tree Construction.

Split Impurity 1000*Goodness

value Left node Right node of the split (1000 Δ)

33 0.20467 0.23524 0.09

34 0.20823 0.19491 0.01

35 0.20795 0.19644 0.01

36 0.20744 0.21112 0.00

37 0.20878 0.09804 0.18

38 0.20857 0.00000 0.37

39 0.20805 0.00000 0.18

40 0.20781 0.00000 0.10

41 0.20769 0.00000 0.06

42 0.20761 0.00000 0.03

43 0.20757 0.00000 0.01

Page 32: Lecture 7 Statistical Lecture- Tree Construction.

Goodness of a Split (8)

• The greatest reduction comes from age split at 24

• It might be more interesting to force an age split at age 19, stratifying the sample into teenagers and adults

• This kind of interactive process is fundamentally important in producing the most interpretable trees

Page 33: Lecture 7 Statistical Lecture- Tree Construction.

Variable 1000*Goodness of the split (1000 Δ) x1 1.5 x2 2.8 x3 4.0 x4 0.6 x5 0.6 x6 3.2 x7 0.7

Page 34: Lecture 7 Statistical Lecture- Tree Construction.

Variable 1000*Goodness of the split (1000 Δ) x8 0.6 x9 0.7 x10 0.2 x11 1.8 x12 1.1 x13 0.5 x14 0.8 x15 1.2

Page 35: Lecture 7 Statistical Lecture- Tree Construction.

Splitting the Root

• The best of the best comes from the race with ΔI=0.004.

• The best split divides the root node according to whether a pregnant woman is Black or not

• After splitting the root node, we continue to divide its two daughter nodes

• The partitioning principle is the same

Page 36: Lecture 7 Statistical Lecture- Tree Construction.
Page 37: Lecture 7 Statistical Lecture- Tree Construction.

Second Splitting (node 2)

• The partition uses only 710 Black women

• The total number of allowable splits decreases from 347 to 332 (minus the race)

• After the split of node 2, we have three nodes, 3, 4, and 5 ready to be split

Page 38: Lecture 7 Statistical Lecture- Tree Construction.

Second Splitting (node 3)

• In the same way, we can divide node 3• We consider only 3151 non-Black women• There are potentially 24-1-1=7 race splits (Whites,

Hispanics, Asians, and others)• There are 339 allowable splits for node 3• After we finish node 3, we go on to nodes 4 and 5• This is so-called recursive partitioning• The resulting tree is called a binary tree

Page 39: Lecture 7 Statistical Lecture- Tree Construction.

Terminal Nodes (1)

• The partitioning process may proceed until the offspring nodes subject to further division cannot be split (saturated)

• For example, when there is only one subject in a node

• The total number of allowable splits for a node drops as we move from one layer to the next

• The number of allowable splits eventually reduces to zero, and the tree cannot be split any further

• Any node cannot be split is a terminal node

Page 40: Lecture 7 Statistical Lecture- Tree Construction.

Terminal Nodes (2)

• The saturated tree is usually too large to be useful, because the terminal nodes are so small

• It is unnecessary to wait until the tree is saturated• A minimum size of a node is set a priori. • The choice of the minimum size depends on the

sample size (e.g., one percent) or can be simply taken as 5 subjects

• Stopping rules should be proposed before the tree becomes too large

Page 41: Lecture 7 Statistical Lecture- Tree Construction.
Page 42: Lecture 7 Statistical Lecture- Tree Construction.

Node Impurity

• The impurity of a node τ is defined as a nonnegative function of the probability, P{Y=1|τ}

• The least impure node should have only one class of the outcome (i.e., P{Y=1|τ}=0 or 1), and its impurity is 0

• Node τ is most impure when P{Y=1|τ}=1/2

Page 43: Lecture 7 Statistical Lecture- Tree Construction.

Impurity Function (1)

The impurity function has a concave shape and can be

formally defined as

p

ppp

YPi

(1)(0) iii)(

)1()(),1,0(any for (ii)

0 (i)

properties thehas function thewhere

}),|1{()(

Page 44: Lecture 7 Statistical Lecture- Tree Construction.

Impurity Function (2)

Common choices of φ include

(i) is the Bayes error (the minimum error) (rarely used)

(ii) is the entropy function(iii) is Gini index (also has some problems)

)1()( iii)(

)1log()1()log()( (ii)

)1,min()( (i)

ppp

ppppp

ppp

Page 45: Lecture 7 Statistical Lecture- Tree Construction.
Page 46: Lecture 7 Statistical Lecture- Tree Construction.

Comparison Between Tree-based and Logistic Regression

Analyses

Page 47: Lecture 7 Statistical Lecture- Tree Construction.

Logistic Regression

}1P{ ii Y

Logistic regression model:

p

j ijj

p

j ijj

i

ij

p

jj

i

i

x

x

x

10

10

10

)exp(1

)exp(

)1

log(

Page 48: Lecture 7 Statistical Lecture- Tree Construction.

Example (Yale study)

Selected Degrees of Coefficient Standard

Variable freedom Estimate Error p-value

============================================

Intercept 1 -2.344 0.4584 0.0001

x6 (educ.) 1 -0.076 0.0313 0.0156

z6 (Black) 1 0.699 0.1688 0.0001

x11 (grav.) 1 0.115 0.0466 0.0137

z10 (horm.) 1 1.539 0.4999 0.0021

z6=1 for an African-American, =0 otherwise.

z10=1 if a subject’s mother used hormones only, =0 otherwise.

Page 49: Lecture 7 Statistical Lecture- Tree Construction.

Example

)539.1115.0699.0076.0344.2exp(1

)539.1115.0699.0076.0344.2exp(

ˆ

10,11,66

10,11,66

iiii

iiii

i

zxzx

zxzx

Page 50: Lecture 7 Statistical Lecture- Tree Construction.
Page 51: Lecture 7 Statistical Lecture- Tree Construction.

Comparison (1)

• The area under the curve is 0.622 for the tree-based model

• The area under the curve is 0.637 for the logistic model

• Both models have lower predictive power when they are applied to further test samples

• There is much that needs to improve the determinants of preterm deliveries. For example, new risk factors should be sought

Page 52: Lecture 7 Statistical Lecture- Tree Construction.

Comparison (2)

• We used only one predictor at a time when partitioning a node

• A linear combination of the predictors can also be considered to split a node

Page 53: Lecture 7 Statistical Lecture- Tree Construction.

Shortcomings

• It is computationally difficult to find an optimal combination

• The resulting split is not as intuitive as before

• The combination is much more likely to be missing

Page 54: Lecture 7 Statistical Lecture- Tree Construction.

First Strategy (1)

• Take the linear equation derived from the logistic regression as a new predictor

• This new predictor is more powerful than any individual predictor

10116616 539.1115.0699.0076.0344.2 zxzxx

Page 55: Lecture 7 Statistical Lecture- Tree Construction.

First Strategy (2)

• Education shows a protective effect (on x16 and also on the left hand side)

• Age has merged as a risk factor. In the fertility literature, whether a woman is at least 35 years a common standard for pregnancy screening. The threshold of 32 is very close to 35

• The risk is not monotonic with respect to x16. The risk is lower when -2.837< x16<=-2.299 than when -2.299< x16<=-2.062

• The area under curve is 0.661

Page 56: Lecture 7 Statistical Lecture- Tree Construction.
Page 57: Lecture 7 Statistical Lecture- Tree Construction.

Second Strategy (1)

• Run the logistic regression after a tree is grown

• We can create five dummy variables, each corresponds to one of the five terminal nodes

Page 58: Lecture 7 Statistical Lecture- Tree Construction.

Dummy Variables

Variable label Specification========================================

z13 Black, unemployed

z14 Black, employed

z15 non-Black, <=4 pregnancies, DES not used

z16 non-Black, <=4 pregnancies, DES used

z17 non-Black, <=4 pregnancies

Page 59: Lecture 7 Statistical Lecture- Tree Construction.

Second Strategy (2)

)016.1885.0071.0341.1exp(1

)016.1885.0071.0341.1exp(

ˆ

16,15,6

16,15,6

iii

iii

i

zzx

zzx

Page 60: Lecture 7 Statistical Lecture- Tree Construction.

Second Strategy (3)

• The equation is very similar to previous result

• The variable z15 and z16 are an interactive version of z6, x11, and z10.

• The coefficient for x6 nearly stays the same

• The area under the new curve is 0.642, which is slightly higher than 0.639

Page 61: Lecture 7 Statistical Lecture- Tree Construction.
Page 62: Lecture 7 Statistical Lecture- Tree Construction.

• http://peace.med.yale.edu/pub

• REREE

• CART (SPSS & Splus)