Download - TWIX Trees WIth eXtra splits · 2005. 4. 7. · 2 Martin Theus, Sergej Potapov – Lehrstuhl für Rechnerorientierte Statistik und Datenanalyse, Universität Augsburg Simon Urbanek

Martin Theus, Sergej Potapov – Lehrstuhl für Rechnerorientierte Statistik und Datenanalyse, Universität AugsburgSimon Urbanek – AT&T Research Labs, Florham Park, NJ

TWIX – Trees WIth eXtra splits

Sergej PotapovMartin Theus

Simon Urbanek

TWIX

Trees WIth eXtra splits

2



Motivation

• Where the classical CART algoritm fails– Greedy algorithms never go for a (locally) second best solution,

which would result in a better overall (global) solution.

Example: XOR-data

-3 -2 -1 0 1 2 3 4 5 6 7 8

0

>=3.532669

1

< 3.532669

2

1.437037

>=5.679445

y

< 3.151441

1

>=3.151441

2

1.290780

< 3.248369

x

>=2.703216

1

< 2.703216

2

1.806452

>=3.248369

x

1.532075

< 5.679445

y

1.500000

x

-3 -2 -1 0 1 2 3 4 5 6 7 8

-3

-2

-1

0

1

2

3

4

5

6

7

8

9

10

3



Trees: More Problems

• Non-orthogonal splitting directions …

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

< -0.05426408

1

>=0.09766749

1

< 0.09766749

2

1

>=-0.05426408

V1

1

>=-0.5117182

V2

>=-1.571775

1

< -1.758858

1

>=-1.758858

2

1

< -1.571775

V2

1

< -0.8793875

V1

>=-0.8793875

2

1

< -0.5117182

V2

1

< 0.2145846

V1

< 1.029791

1

>=1.029791

2

2

>=0.8353015

V2

< 0.8353015

2

2

>=0.2145846

V1

1

V2

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

… can not be handled by single trees, no matter how we split

4



Bagging Revisited

Bagging = Boostrap Aggegation, tries to simulate an infinite sam-

ple by bootstrapping, i.e. sampling from the original sample with

replacement.

Repeat N times:

1. Generate a bootstrap sample Di of size n.

2. Fit model f̂Di.

Depending on the problem the N results are aggregated:

• Classification: g(x) = argmaxc∈C

N∑

i=1

I(fDi(x) = c)

• Regression: g(x) =1

N

N∑

i=1

fDi(x)

5



Ensembles

• General IdeaUse many “different” classifier and combine them to get more accurate results.

• Bagging: Instability of trees yields different models

• Random Forests: Restrict input space randomly to get wider range of models

• Boosting: Iterate to down-weight “bad” points

Question:Why use randomly generated (sub-optimal) models?

6



Tree Mechanics

• CART is a recursive partitioning algorithm

• Each node is split according to the maximum gain in the loss function

• Mountain plots shows the loss function for a variable for all possible split points

Loss Functions Mountain Plot

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

p

Entropy

Gini

Missclassification

6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400

0

100

200

300

6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400

1

3

5

7

9

7



Idea behind TWIX

• Since the greedy CART algorithm not necessarily finds the “optimal” tree, try second best splits.

• Use these forests for bagging

• Expect better results for both single trees and aggregations

• How to find “good” candidates for second best splits?

• Number of inner nodes grows exponentially with the number of levels in the tree

⇒ so does the number of alternative trees

Problems

8



Second Best Splits: South African Heart Data

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesityDeviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance

9



Second Best Splits: Global vs. Local

• When searching for a “best” split point, we can either look for– all top n greatest deviance gains, or– only look for local maxima

• ExampleTop 6 splits

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesityDeviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance

10



Second Best Splits: Forcing Variables

• Often a single variable dominates the potential deviance gain, and shadows all other variables⇒ Many probably good split points are lost.

• Solution:Force a minimum number of split points for each variable.

• Example: top 6 vs. top 3

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobaccoDeviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance

56

11



Second Best Splits: Grid Search

• In some situations good split points might not even associated with some (local) maximum in deviance gain.

(Remember the XOR Example)

• Grid searches are most exhaustive, but also most expensive.

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance

12



Implementation: The Grid

• If we allow sj splits per node on level j of the tree, we get a max-imum of

S =k∏

i=1

s2i−1

i

trees for a tree with no more than k levels. Example:

s = (7, 4, 2) ⇒ S = 720

· 421

· 222

= 7 · 16 · 16 = 1792

⇒Work on a grid of computers

PVMController

13



Implementation: The R-Package

TWIX {TWIX} R Documentation

Top Classification Trees

Description

Top Classification Trees

Usage

TWIX(formula, data = NULL, test.data = NULL, subset = NULL,

method = "deviance", topn.method = "complete", cluster = NULL,

minsplit = 20, minbucket = round(minsplit/3), Devmin = 0.01,

topN = 1, level = 6, st = 1, cl.level = 2, tol = 0.01, ...)

Arguments

formula formula of the form y ~ x1 + x2 + ..., where y must be a factor and x1,x2,... arenumeric.

data an optional data frame containing the variables in the model(training data).

test.data a data frame containing new data.

subset an optional vector specifying a subset of observations to be used.

method Which split points will be used? This can be "deviance" (default), "grid" or "local".If the method is set to: local the program uses the local maxima of the splitfunction(entropy), deviance all values of the entropy, grid grid points.

topn.method one of "complete"(default) or "single". A specification of the consideration of the splitpoints. If set to "complete" it uses split points from all variables, else it uses split pointsper variable.

cluster name of the cluster, if parallel computing will be used.

minsplit the minimum number of observations that must exist in a node.

minbucket the minimum number of observations in any terminal <leaf> node.

Devmin the minimum improvement on entropy by splitting.

topN integer vector. How many splits will be selected and at which level? If length 1, the samesize of splits will be selected at each level. If length > 1, for example topN=c(3,2), 3splits will be chosen at first level, 2 splits at second level and for all next levels 1 split.

level maximum depth of the trees. If level set to 1, trees consist of root node.

st step parameter for method "grid".

cl.level parameter for parallel computing.

tol parameter, which will be used, if topn.method is set to "single".

... further arguments to be passed to or from methods.

Value

a list with the following components :

call the call generating the object.

trees a list of all constructed trees, which include ID, Dev ... for each tree.

greedy.tree greedy tree

14



“Driving the Beast”

• The most important tuning parameters are– method

Which split points will be used? This can be "deviance" (default), "grid" or "local". If the method is set to: local the program uses the local maxima of the split function(entropy), deviance all values of the entropy, grid grid points.

– topn.methodone of "complete"(default) or "single". A specification of the consideration of the split points. If set to "complete" it uses split points from all variables, else it uses split points per variable.

– topNinteger vector. How many splits will be selected and at which level? If length 1, the same size of splits will be selected at each level. If length > 1, for example topN=c(3,2), 3 splits will be chosen at first level, 2 splits at second level and for all next levels 1 split.

– levelmaximum depth of the trees. If level set to 1, trees consist of root node.

– Stopping Rules:■ minsplit

the minimum number of observations that must exist in a node.■ minbucket

the minimum number of observations in any terminal <leaf> node.■ Devmin

the minimum improvement on entropy by splitting.

15



South African Heart Desease Data cont.

• To get a “fair”, i.e. generalizable and not too overfitted classifier, we usually split the data into 3 chunks:

– TrainingAll models are trained using the training data

– ValidationThe “best” model is selected using the validation data(The chosen model is then estimated with training+validation)

– TestThe performance is then assessed with the test data

Training Validation Test

What about trees?Model Structure = Model parameters

16



The Dataset

• 10 Variables, 462 Observations

• Target: Coronary Heart Disease (chd), 34,63% = 160 cases

• Inputs:continuous– sbp systolic blood pressure– tobacco cumulative tobacco (kg)– ldl low density lipoprotein cholesterol– adiposity– typea type-A behavior– obesity– alcohol current alcohol consumption– age age at onset

discrete– famhist family history of heart disease (Present, Absent)omi

tted

101 218101 218101 218

0.0

1.0

17



The Dataset: Univariate

sbp

18



The Dataset: Univariate

101 218

0.0

1.0

0 31.2

0.0

1.0

0.98 15.33

0.0

1.0

6.7 42.5

0.0

1.0

13 78

0.0

1.0

14.7 46.6

0.0

1.0

0 147.2

0.0

1.0

15 64

0.0

1.0

sbp tobacco ldl adiposity

typea obesity alcohol age

19



The Dataset:Bivariate

sbp

0 5 10 15 20 25 30 10 20 30 40 15 25 35 45 20 30 40 50 60

100

140

180

220

05

1015202530

tobacco

ldl

24

68

10

14

10

20

30

40

adiposity

typea

20

40

60

80

15

25

35

45

obesity

alcohol

050

100

150

100 140 180 220

20

30

40

50

60

2 4 6 8 10 14 20 40 60 80 0 50 100 150

age

20



The Dataset: Multivariate

sbp

tobacco

ldl

adiposity

typea

obesity

alcohol

age

sbp

tobacco

ldl

adiposity

typea

obesity

alcohol

age

21



The Competitors: On 100 random samples

• Logistic Regression

glm(response~., data=dataTrain, family="binomial")

• Traditional CART

rpart(response ~ ., data=dataTrain, parms=list(split='information'))

• Bagging

bagging(response~., data=dataTrain, nbagg=100)

• SVM

svm(response~.,data=dataTrain)

!! None of the methods has been further tuned !!

22



Competitors Results: Error Rates

• Trad. Trees

• Bagging

• Support Vektor

Machines

• Log. Regression ! !

!

! !

log..reg

svm.te

bagging

rpart.te

0.25 0.30 0.35 0.40 0.45

23



TWIX: Diagnostics

• For a given “Multitree” we can compare deviance and classification rate on training and test/validation data.

Example:

!!!

!! !

! !!!

!!

!!

!

!

!

!

!!

!

!!

!

!!

!

!

!

!! !

!!

!!

!

!!

!!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

!

!

!!!

!

!

!

!

!!

!!

!

!

!!

!! !

!!

!

!

!! !

!

!!!

!!!

!!

!

!

!

!

!

!

!

!!

!!

!!!

!

!!

!

!

!

!

!

!

!!

!

!

!

!

!!

!

!

! !!

!

!!!

!

!!!

!!

!

!

!

!!

!!

!!

!!

!!

!!!

!!!

!

!!!!!

!

!

!!

!

!!!!

!!

!!

!!

!!

!

!

!

!!

!! !

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!!

!!!

!!

!

!!

!

!!!!

!

!

!

!!!

!!

!!

!

!!

!

!

!!

!!!

!

!

!

!!

!!

!

!!

!

!

!

!

!

!

!

!

!

!!!

!!

!

!

!

!!

!

!

!

!!

!!

!

!

!!!

!

!!

!

!

!!!!!

!!!!

!

!!

!

! !!! !

!!

!

!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!

!

!!

! !!!

! !

!!

!!

!

!!

!

!

!!

! !

!

!!

!

!

!

!!!!

!!

!

!!

!!!!

!!

!

!

!

!!

!

!!

!

!

!!!!

!

!

!!!!

!!

!!

!!

!!!!!!

!

!!

!!

!

!

!

!

!!!!

!

!!

!

!

!

!!

!

!!!

!

!

!

!

!

!!

!

!

!!

!

!

!

!!

!

!

!

!

!

!

!

!

!!!!

!

!!

!!

! !!

!!

!

!

!

!!

!!

!

!!

!

!

!

!

!

!

!!

!

!!

!

!

!!

!!!!

!!

!!

!

!

!!!

!

!

!!

!

!

!

!

!

!!!

!

!!

!

!! !

!!

!

!

!

!

!

!!

!

!!

!

!

!

!

!

!!

!

!!

!!

!

!!

!

!!

!

!

!

!!

!

!

!!!!!!!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!

!!

!

!

!!

!!

!!

!

!

!

!

!!

!!

!

!

!!

!

!

!!

!

!

!

!

!

!

!!!

!!

!!!

!

!!

!

!

!!

!!!!

!

! !

!

!

!

!

!

!

!!!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!!!

!!!!

!!

!

!

!!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!!!!

!

!

!

!

!!

!!

!!

!

!

!

!

!

!

!

!!

!!

!!!

!!

!!

!!

!

!

!!!!

!!

!!

!

!

!!

!!

!

!

!!

!

!!

!!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!!

!

!

!

!! !

!!

!!

!!!!!

!!

!!

!!

!

!

!

!

!

!

!!

!

!!!!!!

!

!

!

!

!!

!!

!

!

!

!!

!

!!!!

!

!

!!!

!!

!

!!

!

!

!

!!

!

!

!

!

!

!

!!

!

!

!

!!

!

!

!!

!

!!

!

! !

!

!

!

!

!

!

!

!

!

50 100 150 200

10

15

20

25

30

35

40

Training Deviance

!

Deviances of trees (n= 870 )

50 100 150 200

050

100

150

!

!

! !

! !

! ! !

! !!

! ! !

! !! ! !! !!

! ! ! ! ! !

! ! !!! ! !! !! !! !

!! !! !!! ! ! ! !!!!

!! !!!! !!!! !! ! !

! ! !!!!! ! !!!!!!!!!!!!!! ! ! !

!!!!!!!!!!!! !!! !

!!! !!!!!!!!!!!!!!!! !! !!!!!!!!!!

!! !! ! ! !!!!!!!!!!!!!!!!

!! !! !!!!!!!!!!!!!!!! !

!!!!!!!!!!!!!! !!! !

!! !!!!!!!!!!!!! !

! !!!!!!!!!!!!!!!!!!!!! !!

!!!!!!!!!! !!!! !!! !!!

!!! !!!!!!!!!!!!!!!! !

! !! ! ! !! !!!!! !

! ! !! !!!!!!! !!!!!!! !!! ! ! !! ! !! ! !

!! !! ! !! ! !

! ! !

! ! ! !

!! !

0.65 0.70 0.75 0.80 0.85

0.6

50

.70

0.7

50

.80

0.8

5

CCR of 870 trees

CCR of training data

CC

R o

f te

tst

da

ta

!

+rpart Tree

24



TWIX: Tree Selection

• The CCR (Correct Classification Rate) of the top TWIX trees are far better than those of greedy trees and most other classification methods

Quest:How to find the “best” trees from the validation data?

• Currently:

Sort trees according to– training deviance– test deviance– training CCR– test CCR

and pick the best!

25



TWIX: Results

• Local Maxima across all variables (50 samples)

– 1,1

– 2,1

– 4,2

– 6,3

– 8,4

– 10,5 !! ! !

!! !!! !

!!! !! ! !

!!! !

X10.5

X8.4

X6.3

X4.2

X2.1

X1.1

0.25 0.30 0.35 0.40 0.45

26



TWIX: Results

• Local Maxima, within all variables (50 samples)

– 1,1

– 2,1

– 4,2

– 6,3

– 8,4

– 10,5 !

!

!

! !!

!

X10.5

X8.4

X6.3

X4.2

X2.1

X1.1

0.25 0.30 0.35 0.40 0.45

27



TWIX: Results

• 100 runs of a (10,5,2) tree

!

!!

! !! !

! !!!

! !

!

! !

!

rpart.tr

TWIX.Tr

log..reg

TWIX.Bag

bagging

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

rpart (training)

svm (training)

TWIX (training)

TWIX (validation)

logistic (test)

svm (test)

TWIX bag (test)

TWIX (test)

Bagging (test)

rpart (test)

28



Bagged TWIX

• What bagging should look like:

#trees

erro

r ra

te

29



Bagged TWIX: Example

• Bagging TWIX trees does often not improve the CCR.

0 50 100 150 200 250 300

0.30

0.35

0.40

Index

ps

error of single trees

0 10 20 30 40 50

0.30

0.35

0.40

Index

bagseq

error of bagged trees

30



Conclusion

• For the South-African-Heart Data– Single TWIX-trees out-perform traditional tree and usually bagged trees

– Bagged TWIX beats bagged trees and reaches top performance

– TWIX gives good single alternative tree models

• Still much room for performance improvement– Stopping rules and pruning– Better tree selection– Improved selection of second best splits

– More tests on more datasets– Better understanding of the tree-families

• Computational effort is high, but parallel computing speeds up dramatically

• Complex methods are hard to implement and hard to test