Advanced cart 2007

Introduction to CART® Salford Systems ©2007

Data Mining with Decision Trees:Data Mining with Decision Trees:An Introduction to CARTAn Introduction to CART®®

Dan Steinberg

Mikhail [email protected]

http://www.salford-systems.com

Introduction to CART® Salford Systems ©2007 2

In The Beginning…

In 1984 Berkeley and Stanford statisticians announced a remarkable new classification tool which: Could separate relevant from irrelevant predictors Did not require any kind of variable transforms (logs, square roots) Impervious to outliers and missing values Could yield relatively simple and easy to comprehend models Required little to no supervision by the analyst Was frequently more accurate than traditional logistic regression or

discriminant analysis, and other parametric tools


The Years of Struggle

CART was slow to gain widespread recognition in late 1980s and early 1990s for several reasons Monograph introducing CART is challenging to read

brilliant book overflowing with insights into tree growing methodology but fairly technical and brief

Method was not expounded in any textbooks Originally taught only in advanced graduate statistics

classes at a handful of leading Universities Original standalone software came with slender

documentation, and output was not self-explanatory Method was such a radical departure from conventional

statistics


The Final Triumph Rising interest in data mining

Availability of huge data sets requiring analysis Need to automate or accelerate and improve analysis process

Comparative performance studies

Advantages of CART over other tree methods handling of missing values assistance in interpretation of results (surrogates) performance advantages: speed, accuracy

New software and documentation make techniques accessible to end users

Word of mouth generated by early adopters Easy to use professional software available from Salford

Systems


So What is CART?

Best illustrated with a famous example-the UCSD Heart Disease study

Given the diagnosis of a heart attack based on Chest pain, Indicative EKGs, Elevation of enzymes typically

released by damaged heart muscle Predict who is at risk of a 2nd heart attack and early death

within 30 days Prediction will determine treatment program (intensive care or

not) For each patient about 100 variables were available, including:

Demographics, medical history, lab results 19 noninvasive variables were used in the analysis


Typical CART Solution Example of a

CLASSIFICATION tree

Dependent variable is categorical (SURVIVE, DIE)

Want to predict class membership

PATIENTS = 215

SURVIVE 178 82.8%DEAD 37 17.2%

Is BP<=91?

Terminal Node A

SURVIVE 6 30.0%DEAD 14 70.0%

NODE = DEAD

Terminal Node B

SURVIVE 102 98.1%DEAD 2 1.9%

NODE = SURVIVE

PATIENTS = 195

SURVIVE 172 88.2%DEAD 23 11.8%

Is AGE<=62.5?

Terminal Node C

SURVIVE 14 50.0%DEAD 14 50.0%

NODE = DEAD

PATIENTS = 91

SURVIVE 70 76.9%DEAD 21 23.1%

Is SINUS<=.5?

Terminal Node D

SURVIVE 56 88.9%DEAD 7 11.1%

NODE = SURVIVE

<= 91 > 91

<= 62.5 > 62.5

>.5<=.5


How to Read It Entire tree represents a complete analysis or model Has the form of a decision tree Root of inverted tree contains all data Root gives rise to child nodes Child nodes can in turn give rise to their own children At some point a given path ends in a terminal node Terminal node classifies object Path through the tree governed by the answers to

QUESTIONS or RULES


Decision Questions

Question is of the form: is statement TRUE? Is continuous variable X c ? Does categorical variable D take on levels i, j, or k?

e.g. Is geographic region 1, 2, 4, or 7?

Standard split: If answer to question is YES a case goes left; otherwise it goes right This is the form of all primary splits

Question is formulated so that only two answers possible Called binary partitioning In CART the YES answer always goes left


The Tree is a Classifier

Classification is determined by following a case’s path down the tree

Terminal nodes are associated with a single class Any case arriving at a terminal node is assigned to that class

The implied classification threshold is constant across all nodes and depends on the currently set priors, costs, and observation weights (discussed later) Other algorithms often use only the majority rule or a limited set of

fixed rules for node class assignment thus being less flexible In standard classification, tree assignment is not

probabilistic Another type of CART tree, the class probability tree does report

distributions of class membership in nodes (discussed later) With large data sets we can take the empirical distribution in

a terminal node to represent the distribution of classes


The Importance of Binary Splits

Some researchers suggested using 3-way, 4-way, etc. splits at each node (univariate)

There are strong theoretical and practical reasons why we feel strongly against such approaches: Exhaustive search for a binary split has linear algorithm complexity,

the alternatives have much higher complexity Choosing the best binary split is “less ambiguous” than the

suggested alternatives Most importantly, binary splits result to the smallest rate of data

fragmentation thus allowing a larger set of predictors to enter a model

Any multi-way split can be replaced by a sequence of binary splits There are numerous practical illustrations on the validity of

the above claims


Accuracy of a Tree

For classification trees All objects reaching a terminal node are classified in the same way

e.g. All heart attack victims with BP>91 and AGE less than 62.5 are classified as SURVIVORS

Classification is same regardless of other medical history and lab results

The performance of any classifier can be summarized in terms of the Prediction Success Table (a.k.a. Confusion Matrix)

The key to comparing the performance of different models is how one “wraps” the entire prediction success matrix into a single number Overall accuracy Average within-class accuracy Most generally – the Expected Cost


Prediction Success Table

A large number of different terms exists to name various simple ratios based on the table above: Sensitivity, specificity, hit rate, recall, overall accuracy, false/true

positive, false/true negative, predictive value, etc.

Classified As

TRUE Class 1 Class 2"Survivors" "Early Deaths" Total % Correct

Class 1

"Survivors" 158 20 178 88%

Class 2

"Early Deaths" 9 28 37 75%

Total 167 48 215 86.51%


Tree Interpretation and Use

Practical decision tool Trees like this are used for real world decision making

Rules for physicians Decision tool for a nationwide team of salesmen needing to classify

potential customers

Selection of the most important prognostic variables Variable screen for parametric model building Example data set had 100 variables Use results to find variables to include in a logistic regression

Insight — in our example AGE is not relevant if BP is not high Suggests which interactions will be important


Binary Recursive Partitioning

Data is split into two partitions thus “binary” partition

Partitions can also be split into sub-partitions hence procedure is recursive

CART tree is generated by repeated dynamic partitioning of data set Automatically determines which variables to use Automatically determines which regions to focus on The whole process is totally data driven


Three Major Stages

Tree growing Consider all possible splits of a node Find the best split according to the given splitting

rule (best improvement) Continue sequentially until the largest tree is grown

Tree Pruning – creating a sequence of nested trees (pruning sequence) by systematic removal of the weakest branches

Optimal Tree Selection – using test sample or cross-validation to find the best tree in the pruning sequence


Stage 1Stage 2

Stage 3

General Workflow

Historical Data

Learn

Test

Validate

Build a Sequence of Nested Trees

Monitor Performance

Best

Confirm Findings


Large Trees and Nearest Neighbor Classifier

Fear of large trees is unwarranted! To reach a terminal node of a large tree many conditions

must be met Any new case meeting all these conditions will be very similar

to the training case(s) in terminal node Hence the new case should be similar on the target variable also

Key Result: Largest tree possible has a misclassification rate R asymptotically not greater than twice the Bayes rate R* Bayes rate is the error rate of the most accurate possible classifier

given the data Thus useful to check largest tree error rate

For example, largest tree has relative error rate of .58 so theoretically best possible is .29

The exact analysis follows


Derivation

Assume binary outcome and let p(x) be the conditional probability of focus class given x

Then the conditional Bayes risk at point x isr*(x) = min [p(x), 1 - p(x) ]

Now suppose that the actual prediction at X is governed by the class membership of the nearest neighbor xn, then the conditional NN risk isr(x) = p(x) [1-p(xn)] + [1-p(x)] p(xn) 2 p(x) [1- p(x)]

Hence r(x) = 2 r*(x) [1-r*(x)] on large samples Or in the global sense: R* R 2 R* (1 – R*)


Illustration

The error of the largest tree is asymptotically squeezed between the green and red curves

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

R*R* 2R*(1-R*) 2R*


Impact on the Relative Error

RE<x> stands for the asymptotically largest possible relative error of the largest tree assuming <x> initial proportion of the smallest of the target classes (smallest of the priors)

More unbalanced data might substantially inflate the relative errors of the largest trees

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.1 0.2 0.3 0.4 0.5 R*

RE0.5

RE0.4

RE0.3

RE0.2

RE0.1


Searching All Possible Splits

For any node CART will examine ALL possible splits Computationally intensive but there are only a finite number of splits

Consider first the variable AGE--in our data set has minimum value 40 The rule

Is AGE 40? will separate out these cases to the left — the youngest persons

Next increase the AGE threshold to the next youngest person

Is AGE 43 This will direct six cases to the left

Continue increasing the splitting threshold value by value


Split Tables Split tables are used to locate continuous splits

There will be as many splits as there are unique values minus one

AGE BP SINUST SURVIVE40 91 0 SURVIVE

40 110 0 SURVIVE

40 83 1 DEAD

43 99 0 SURVIVE

43 78 1 DEAD

43 135 0 SURVIVE

45 120 0 SURVIVE

48 119 1 DEAD

48 122 0 SURVIVE

49 150 0 DEAD

49 110 1 SURVIVE

Sorted by Age Sorted by Blood Pressure

AGE BP SINUST SURVIVE43 78 1 DEAD

40 83 1 DEAD

40 91 0 SURVIVE

43 99 0 SURVIVE

40 110 0 SURVIVE

49 110 1 SURVIVE

48 119 1 DEAD

45 120 0 SURVIVE

48 122 0 SURVIVE

43 135 0 SURVIVE

49 150 0 DEAD


Categorical Predictors

Categorical predictors spawn many more splits For example, only four levels A,

B, C, D result to 7 splits In general, there will be 2K-1-1

splits CART considers all possible

splits based on a categorical predictor unless the number of categories exceeds 15

Left Right

1 A B, C, D

2 B A, C, B

3 C A, B, D

4 D A, B, C

5 A, B C, D

6 A, C B, D

7 A, D B, C


Binary Target Shortcut

When the target is binary, special algorithms reduce compute time to linear (K-1 instead of 2K-1-1) in levels 30 levels takes twice as long as 15, not 10,000+ times as long

When you have high-level categorical variables in a multi-class problem Create a binary classification problem Try different definitions of DPV (which groups to combine) Explore predictor groupings produced by CART From a study of all results decide which are most informative Create new grouped predicted variable for multi-class problem


GYM Model Example

The original data comes from a financial institution and disguised as a health club

Problem: need to understand a market research clustering scheme

Clusters were created externally using 18 variables and conventional clustering software

Need to find simple rules to describe cluster membership

CART tree provides a nice way to arrive at an intuitive story


not usedtarget

Variable Definitions

ID Identification # of member

CLUSTER Cluster assigned from clustering scheme (10 level categorical coded 1-10)

ANYRAQT Racquet ball usage (binary indicator coded 0, 1)

ONRCT Number of on-peak racquet ball uses

ANYPOOL Pool usage (binary indicator coded 0, 1)

ONPOOL Number of on-peak pool uses

PLRQTPCT Percent of pool and racquet ball usage

TPLRCT Total number of pool and racquet ball uses

ONAER Number of on-peak aerobics classes attended

OFFAER Number of off-peak aerobics classes attended

SAERDIF Difference between number of on- and off-peak aerobics visits

TANNING Number of visits to tanning salon

PERSTRN Personal trainer (binary indicator coded 0, 1)

CLASSES Number of classes taken

NSUPPS Number of supplements/vitamins/frozen dinners purchased

SMALLBUS Small business discount (binary indicator coded 0, 1)

OFFER Terms of offer

IPAKPRIC Index variable for package price

MONFEE Monthly fee paid

FIT Fitness score

NFAMMEN Number of family members

HOME Home ownership (binary indicator coded 0, 1)

predictors


Initial Runs

Dataset: gymsmall.syd Set the target and predictors Set 50% test sample Navigate the largest, smallest, and optimal trees Check [Splitters] and [Tree Details] Observe the logic of pruning Check [Summary Reports] Understand competitors and surrogates Understand variable importance


Pruning Multiple Nodes

Note that Terminal Node 5 has class assignment 1

All the remaining nodes have class assignment 2

Pruning merges Terminal Nodes 5 and 6 thus creating a “chain reaction” of redundant split removal


Pruning Weaker Split First

Eliminating nodes 3 and 4 results to “lesser” damage – loosing one correct dominant class assignment

Eliminating nodes 1 and 2 or 5 and 6 would result to loosing one correct minority class assignment


Understanding Variable Importance

Both FIT and CLASSES are reported as equally important

Yet the tree only includes CLASSES

ONPOOL is one of the main splitters yet is has very low importance

Want to understand why


Competitors and Surrogates

Node details for the root node split are shown above Improvement measures class separation between two sides Competitors are next best splits in the given node ordered by

improvement values Surrogates are splits most resembling (case by case) the main

split Association measures the degree of resemblance Looking for competitors is free, finding surrogates is expensive It is possible to have an empty list of surrogates

FIT is a perfect surrogate (and competitor) for CLASSES


The Utility of Surrogates

Suppose that the main splitter is INCOME and comes from a survey form

People with very low or very high income are likely not to report it – informative missingness in both tails of the income distribution

The splitter itself wants to separate low income on the left from high income on the right – hence, it is unreasonable to treat missing INCOME as a separate unique value (usual way to handle such case)

Using a surrogate will help to resolve ambiguity and redistribute missing income records between the two sides

For example, all low education subjects will join the low income side while all high education subjects will join the high income side


Competitors and Surrogates Are Different

For symmetry considerations, both splits must result to identical improvements

Yet one split can’t be substituted for another Hence Split X and Split Y are perfect competitors (same

improvement) but very poor surrogates (low association) The reverse does not hold – a perfect surrogate is always a

perfect competitor

ABC

AB

CSplit X

ABC

AC

BSplit Y

Three nominal classes

Equally present Equal priors, unit

costs, etc.


Calculating Association

Ten observations in a node

Main splitter X Split Y mismatches

on 2 observations The default rule

sends all cases to the dominant side mismatching 3 observations

Split X Split Y

Default

Association = (Default – Split Y)/Default = (3-2)/3 = 1/3


Notes on Association

Measure local to a node, non-parametric in nature Association is negative for nodes with very poor

resemblance, not limited by -1 from below Any positive association is useful (better strategy

than the default rule) Hence define surrogates as splits with the highest

positive association In calculating case mismatch, check the possibility

for split reversal (if the rule is met, send cases to the right) CART marks straight splits as “s” and reverse splits as “r”


Calculating Variable Importance

List all main splitters as well as all surrogates along with the resulting improvements

Aggregate the above list by variable accumulating improvements

Sort the result descending Divide each entry by the largest value and scale to 100% The resulting importance list is based on the given tree

exclusively, if a different tree a grown the importance list will most likely change

Variable importance reveals a deeper structure in stark contrast to a casual look at the main splitters


Mystery Solved

In our example improvements in the root node dominate the whole tree

Consequently, the variable importance reflects the surrogates table in the root node


Penalizing Individual Variables

We use variable penalty control to penalize CLASSES a little bit

This is enough to ensure FIT the main splitter position

The improvement associated with CLASSES is now downgraded by 1 minus the penalty factor


Penalizing Missing and Categorical Predictors

Since splits on variables with missing values are evaluated on a reduced set of observations, such splits may have an unfair advantage

Introducing systematic penalty set to the proportion of the variable missing effectively removes the possible bias

Same holds for categorical variables with large number of distinct levels

We recommend that the user always set both controls to 1


Splitting Rules - Introduction

Splitting rule controls the tree growing stage At each node, a finite pool of all possible splits exists

The pool is independent of the target variable, it is based solely on the observations (and corresponding values of predictors) in the node

The task of the splitting rule is to rank all splits in the pool according to their improvements The winner will become the main splitter The top runner-ups will become competitors

It is possible to construct different splitting rules emphasizing various approaches to the definition of the “best split”


Priors Adjusted Probabilities

Assume K target classes Introduce prior probabilities i, i =1,…,K Assume the number of LEARN observations within each

class: Ni, i =1,…,K

Consider node t containing ni, i =1,…,K LEARN records of each target class

Then the priors adjusted probability that a case lands in node t (node probability) is: pt = (ni / Ni) i

The conditional probability of class i given that a case reached node t is: pi(t) = (ni / Ni) i / pt

The conditional node probability of a class is called within node probability and is used as an input to a splitting rule


GINI Splitting Rule

Introduce node t impurity asIt = 1 - pi

2(t) = ij pi(t) pj(t) Observe that a pure node has zero impurity The impurity function is convex The impurity function is maximized when all within node

probabilities are equal Consider any split and let pleft be the probability that

a case goes left given that we are in the parent node, pright – probability that a case goes right pleft = pt

left / ptparent and pright = pt

right / ptparent

Then the GINI split improvement is defined as: = Iparent – pleft Ileft – pright Iright


Illustration

0.00.2

0.40.6

0.81.0 0.0

0.2

0.40.6

0.81.0

Imp

urity

0.0

0.2

0.4

0.6

GINI Impurity Function for a 3-class Target

(0.33, 0.33)

max


GINI Properties

The impurity reduction is weighted according to the node size upon which it operates

GINI splitting rule sequentially minimizes overall tree impurity (defined as the sum over all terminal nodes of node impurities times the node probabilities)

The splitting rule favors end-cut splits (when one class is separated from the rest) – which follows from its mathematical expression


Example

GYM cluster example, using 10-level target

GINI splitting rule separates classes 3 and 8 from the rest in the root node


ENTROPY Splitting Rule

Similar in spirit to GINI The only difference is in the definition of the

impurity function It = - pi(t) log pi(t) The function is convex, zero for pure nodes, and

maximized for equal mix of within node probabilitiesImpurity on Binary Target

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.25 0.5 0.75 1

Probability

No

rmal

ized

Imp

rove

men

t

GINI

ENTROPY

Because of the different shape, a tree may evolve differently


Example

GYM cluster example, using 10-level target ENTROPY splitting rule separates six

classes on the left from four classes on the right in the root node


TWOING Splitting Rule

Group K target classes into two Super-classes in the parent node according to the rule:C1 = {i : pi(tleft) pi(tright) }C2 = {i : pi(tleft) < pi(tright) }

Compute the GINI improvement assuming that we had a binary target defined as the super-classes above

This usually results to more balanced splits Target classes of “similar” nature will tend to group

together All minivans are separated from the pickup trucks

ignoring the fine differences within each group


Example

GYM cluster example, using 10-level target TWOING splitting rule separates five classes

on the left from five classes on the right in the root node thus producing a very balanced split


Best Possible Split

Suppose that we have a K-class multinomial problem, pi – within-node probabilities for the parent node, pleft - probability of the left direction

Further assume that we have the largest possible pool of binary splits available (say, using categorical splitter with all unique values)

Question: what is the resulting best split under each of the splitting rules?

GINI: separate major class j from the rest, where pj = maxi(pi) – end cut approach!

TWOING and ENTROPY: group classes together such that| pleft - 0.5| is minimized – balanced approach!


Example

GINI trees usually grow narrow and deep while TWOING trees usually grow shallow and wide

A 40B 30C 20D 10

A 40 B 30C 20D 10

GINI Best Split

A 40B 30C 20D 10

A 40D 10

B 30C 20

TWOING Best Split


Symmetric GINI

One possible way to introduce cost matrix actively into the splitting ruleIt = ij Cij pi(t) pj(t)where Cij stands for the cost of class i being misclassified as class j

Note that the costs are automatically symmetrized thus rendering the rule less useful for nominal targets with asymmetric costs

The rule is most useful when handling ordinal outcome with costs proportional to the rank distance between the levels

Another “esoteric” use would be to cancel asymmetric costs during the tree growing stage for some reason while allowing asymmetric costs during pruning and model selection


Ordered TWOING

Assume ordered categorical target Original twoing ignores the order or classes during the

creation of super-classes (strictly nominal approach) Ordered twoing restricts super-class creation only to

ordered sets of levels, the rest remains identical to twoing The rule is most useful when the ordered separation of

target classes is desirable But recall that the splitting rule has no control over the pool of splits

available, it is only used to evaluate and suggest the best possible split – hence, the “magic” of the ordered twoing splitting rule is somewhat limited

Using symmetric GINI with properly introduced costs usually provides more flexibility in this case


Class Probability Rule

During the pruning stage, any pair of adjacent terminal nodes with identical class assignment is automatically eliminated

This is because by default CART builds a classifier and keeping such a pair of nodes would contribute nothing to the model classification accuracy

In reality, we might be interested in seeing such splits For example, having “very likely” responder node next to “likely”

responder node could be informative

Class probability splitting rule does exactly that However, it comes at a price: due to convoluted theoretical reasons,

optimal class probability trees become over-pruned Therefore, only experienced CART users should try this


Example


Favor Even Splits Control

It is possible to further refine splitting rule behavior by introducing “Favor Even Splits” parameter (a.k.a. Power)

This control is most sensitive with the twoing splitting rule and was actually “inspired” by it

We recommend that the user always considers at least the following minimal set of modeling runs GINI with Power=0 ENTROPY with Power=0 TWOING with Power=1

BATTERY RULES can be used to accomplish this automatically


Fundamental Tree Building Mechanisms

Splitting rules have somewhat limited impact on the overall tree structure, especially on binary targets

We are about to introduce a powerful set of PRIORS and COSTS controls that will make profound implications for all stages of tree development and evaluation

These controls lie at the core of successful model building techniques


Key Players

Using historical data, build a model for the appropriately chosen sets of priors and costs

Population DM Engine

Analyst

Model

Dataset

Priors i

Costs Cij


Fraud Example

There are 1909 observations with 42 outcomes of interest

Having such extreme class distribution poses a lot of modeling challenges

The effect of key control settings is usually magnified on such data

We will use this dataset to illustrate the workings of priors and costs below


Fraud Example – PRIORS EQUAL (0.5, 0.5)

High accuracy at 93% Richest node has 37% Small relative error 0.145 Low class assignment threshold at 2.2%


Fraud Example – PRIORS SPECIFY (0.9, 0.1)

Reasonable accuracy at 64% Richest node has 57% Larger relative error 0.516 Class assignment threshold around 15%


Fraud Example – PRIORS DATA (0.98, 0.02)

Poor accuracy at 26% Richest node has 91% High relative error 0.857 Class assignment threshold at 50%


Internal Class Assignment Rule

Recall the definition of the within node probabilitypi(t) = (ni / Ni) i / pt

Internal class assignment rule is always the same:assign node to the class with the highest pi(t)

For PRIORS DATA this simplifies to the simple node majority rule: pi(t) ni

For PRIORS EQUAL this simplifies to class assignment with the highest lift with respect to the root node presence: pi(t) ni / Ni


Putting It All Together

As prior on the given class increases The class assignment threshold decreases Node richness goes down But class accuracy goes up PRIORS EQUAL use the root node class ratio as

the class assignment threshold – hence, most favorable conditions to build a tree

PRIORS DATA use the majority rule as the class assignment threshold – hence, difficult modeling conditions on unbalanced classes


The Impact of Priors

We have a mixture of two overlapping classes The vertical lines show root node splits for different sets of priors

(the left child is classified as red, the right child is classified as blue) Varying priors provides effective control over the tradeoff between

class purity and class accuracy

r=0.1, b=0.9 r=0.9, b=0.1

r=0.5, b=0.5


Model Evaluation

Knowing model Prediction Success (PS) table on a given dataset, calculate the Expected Cost for the given cost matrix assuming the prior distribution of classes in the population

Population Evaluator

Analyst

Expected Cost

Dataset

Priors i

Costs Cij

Model PS TableCells nij


Estimating Expected Cost

The following estimator is usedR = ij Cij (nij / Ni) i

For PRIORS DATA, unit costs: i = Ni / NR = [ ij nij ] / Noverall misclassification rate

For PRIORS EQUAL, unit costs: i = 1 / K R = { (ij nij ) / Ni } / K average within class misclassification rate

The relative cost is computed by dividing by the expected cost of the trivial classifier (all classes are assigned to the dominant cost-adjusted class)


PRIORS DATA – Relative Cost

Expected cost R = (31+5)/1909 = 36/1909 = 1.9% Trivial model R0 = 42/1909 = 2.2%

This is already a very small error value

Relative cost Rel = R / R0 = 36/42 = 0.857 Relative cost will be less than 1.0 only if the total number

of cases misclassified is below 42 – a very difficult requirement


PRIORS EQUAL – Relative Cost

Expected cost R = (138/1867+3/42) / 2 = 7.3% Trivial model R0 = (0/1867+42/42) / 2 = 50%

A very large error value independent of the initial class distribution

Relative cost Rel = R / R0 = 138/1867 + 3/42 = 0.145 Relative cost will be less than 1.0 for a wide variety of modeling

outcomes The minimum usually tends to equalize individual class accuracies


Using Equivalent Costs

Setting the ratio of costs mimic the ratio of DATA priors reproduces the PRIORS EQUAL run as seen both from the tree structure and the relative error value

On binary problems both mechanisms produce identical range of models


Automated CART Runs

Starting with CART 6.0 we have added a new feature called batteries

A battery is a collection of runs obtained via a predefined procedure

A trivial set of batteries is varying one of the key CART parameters: RULES – run every major splitting rule ATOM – vary minimum parent node size MINCHILD – vary minimum node size DEPTH – vary tree depth NODES – vary maximum number of nodes allowed FLIP – flipping the learn and test samples (two runs only)


More Batteries

DRAW – each run uses a randomly drawn learn sample from the original data

CV – varying number of folds used in cross-validation CVR – repeated cross-validation each time with a different

random number seed KEEP – select a given number of predictors at random from

the initial larger set Supports a core set of predictors that must always be present

ONEOFF – one predictor model for each predictor in the given list

The remaining batteries are more advanced and will be discussed individually


Battery MCT

The primary purpose is to test possible model overfitting by running Monte Carlo simulation Randomly permute the target variable Build a CART tree Repeat a number of times The resulting family of profiles measures intrinsic

overfitting due to the learning process Compare the actual run profile to the family of random

profiles to check the significance of the results This battery is especially useful when the relative

error of the original run exceeds 90%


Battery MVI

Designed to check the impact of the missing values indicators on the model performance Build a standard model, missing values are handled via

the mechanism of surrogates Build a model using MVIs only – measures the amount of

predictive information contained in the missing value indicators alone

Combine the two Same as above with missing penalties turned on


Battery PRIOR

Vary the priors for the specified class within a given range For example, vary the prior on class 1 from 0.2 to 0.8 in

increments of 0.02 Will result to a family of models with drastically

different performance in terms of class richness and accuracy (see the earlier discussion)

Can be used for hot spot detection Can also be used to construct a very efficient voting

ensemble of trees


Battery SHAVING

Automates variable selection process Shaving from the top – each step the most important

variable is eliminated Capable to detect and eliminate “model hijackers” – variables that

appear to be important on the learn sample but in general may hurt ultimate model performance (for example, ID variable)

Shaving from the bottom – each step the least important variable is eliminated May drastically reduce the number of variables used by the model

SHAVING ERROR – at each iteration all current variables are tried for elimination one at a time to determine which predictor contributes the least May result to a very long sequence of runs (quadratic complexity)


Battery SAMPLE

Build a series of models in which the learn sample is systematically reduced

Used to determine the impact of the learn sample size on the model performance

Can be combined with BATTERY DRAW to get additional spread


Battery TARGET

Attempt to build a model for each variable in the given list as a target using the remaining variables as predictors

Very effective in capturing inter-variable dependency patterns

Can be used as an input to variable clustering and multi-dimensional scaling approaches

When run on MVIs only, can reveal missing value patterns in the data


Recommended Reading

Breiman, L., J. Friedman, R. Olshen and C. Stone (1984), Classification and Regression Trees, Pacific Grove: Wadsworth

Breiman, L (1996), Bagging Predictors, Machine Learning, 24, 123-140

Breiman, L (1996), Stacking Regressions, Machine Learning, 24, 49-64.

Friedman, J. H., Hastie, T. and Tibshirani, R. "Additive Logistic Regression: a Statistical View of Boosting." (Aug. 1998)

Advanced cart 2007

Technology

Transcript of Advanced cart 2007