Ensemble Models and Partitioning Algorithms in SAS® Enterprise … · 2018. 2. 2. · decision...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Ensemble Models and Partitioning Algorithms in SAS® Enterprise Miner™


2

Goals

• Introduce ensemble models

• Increase awareness of capabilities in SAS® Enterprise Miner™ supporting ensemble modeling

• Share resources for learning more


3

SAS Enterprise MinerEnsemble Models and Partitioning Algorithms

• Ensemble Models

• Decision Trees

• Perturb and Combine

• Bagging

• Boosting

• Gradient Boosting

• Random Forests (SAS 9.4)

• Ensemble Forests

• Stacked Ensembles


“Wisdom of the Crowd”


5

ExperimentHow Many Jelly Beans Are in the Jar?


Ensemble Models


7

Ensemble Modeling

Introduction

Two or more predictive models combined to create a potentially more accurate model

Works better when model predictions are uncorrelated

“Wisdom of the crowd” – Aristotle (‘Politics’)

Collective wisdom of many is likely more accurate than any one


8

Ensemble ModelingApplications

http://www.nhc.noaa.gov/


9



10



11

Ensemble ModelingApproaches to Build Models

Different algorithms

• Example: Decision Tree + SVM + Neural Network

One algorithm, different configurations

• Example: Various configurations of Neural Networks

One algorithm, different data samples

• Example: Random Forest, Gradient Boosting

Combine Models

Build Predictive Models


12








Combine Models

Decision Tree SVM Neural Network


13








Combine Models

Neural Networks


14








Combine Models

Decision Trees

Different Samples of Data


15

Ensemble ModelingAn Ensemble Model Is a Combination of Multiple Models.


16

Ensemble ModelingApproaches to Combine Models

• Averaging or Voting

• Stacking/Blending

• Cluster-based selection Combine Models

Build Predictive Models


17




• Cluster-based selection


(P1+P2+P3)/3

P2P1 P3


18




• Cluster-based selection


P2 P3P1


19




• Cluster-based selection Cluster


P2P1 P3

P

Combine Models

P2 P3


20

Ensemble Models

The result of combining models can sometimes lead to a more accurate model.*

* It is important to note that the ensemble model can be more accurate than theindividual models only if the individual models disagree with one another.


21

Trade-Off


Decision Trees


23

Decision TreesWhat Is It?

• Linear separation of data using “if then else” logic

• Separation is performed via an exhaustive search of splitting points for each variable.

• Many different architectural variations based on the above architecture

• Users might refer to them as

• CHAID Trees

• CART Trees

• C4.5 Trees

• C5.0 Trees.

• Each of the above is simply a variation on the tree architecture.


24

Decision Tree

Easy to Visualize


25

Decision TreesDecision Rules

Node = 9

if Saving Balance < 2500 or MISSING

AND Money Market Balance >= 7000

then

Tree Node Identifier = 9

Number of Observations = 917.56099466

Predicted: INS=1 = 0.05

Predicted: INS=0 = 0.95


26

Analytics Life CycleDecision Trees Can Help in Various Stages.

Formulate

ProblemData

Preparation

Data Exploration

Transform & Select

Develop Models

Validate Models

Deploy Model

Evaluate & Monitor Model


27

Decision TreesUses

• Data exploration

• Generating business rules

• Segmentation

• Missing value imputation

• Variable transformation and variable selection

• Predictive models

• Comparison Model

• Test decision trees versus regression to determine whether there are two (or more) different populations in the data and possibly two models need to be built.


28

Decision TreesMultivariate Step Function


29

Decision TreesAdvantages

• Fast training time

• Can handle outliers and missing values

• Simple to interpret

• Simple to deploy models into production

• Wide range of uses (models, fix missing values, variable selection, and so on)

• Consistently gives the same accuracy when data changes.


30

Decision TreesDisadvantages

• Coarse segmenting (everybody in the same leaf gets same prediction).

• Small change in data can result in a completely different looking tree.

• Highly linear. It is difficult to discover non-linear transformations and multi-factor interactions.


31

Decision TreesDisadvantages

• Coarse segmenting (everybody in the same leaf gets same prediction).

• Small change in data can result in a completely different looking tree.

• Highly linear. It is difficult to discover non-linear transformations and multi-factor interactions


Ensemble Trees


33

Decision Trees: InstabilityDisadvantage? or “Feature to Exploit”?

Small change in data can result in a completely different looking tree.


34

Ensemble Trees

Perturb and Combine (P&C) methods generate multiple models by manipulating the distribution of the data or altering the construction method and then averaging the results.


Bagging


36

Bagging

Bagging (bootstrap aggregation) is the original P&C method (Breiman).

A bootstrap sample is a random sample of size n drawn from the training data with replacement.

• Some observations will be left out of the sample.

• Some observations will be represented more than once.

A tree is built on each sample.

Vote or average the posterior probabilities.


37

Bagging


38

Start and End GroupsSAS Enterprise Miner


39

BaggingSAS Enterprise Miner


40

Ensemble Trees

“Bagging goes a long way towards making a silk purse out of a sow’s ear, especially if the sow’s ear is twitchy. It is a relatively easy way to improve an existing method, since all that needs adding is a loop in front that selects the bootstrap sample and sends it to the procedure and a back end that does the aggregation. What one loses, with the trees, is a simple and interpretable structure. What one gains is increased accuracy.”

Leo Breiman (1996)


Boosting


42

BoostingReweighted Sampling

Arcing (adaptive resampling and combining) methods sequentially perturb the training data based on the results of previous models.

Cases that are incorrectly classified are given more weight in subsequent models.


43

Boosting


44

Boosting

For the ith case, the arc-x4 weights are calculated as

where 0<= mi <= k is the number of times that the ith case is misclassified in the preceding steps.

𝑝𝑖 =1 +𝑚𝑖

4

σ(1 +𝑚𝑖4)


45

BoostingSAS Enterprise Miner


47

ComparisonSingle, Bagged, and Boosted Trees


Gradient Boosting


49

Gradient BoostingWhat Is It?

• A combination of several “decision trees.”

• Gradient boosting consists of a forest of small decision trees (“shrubs”, “stumps”).

• Each shrub is poor at predicting target, but each subsequent shrub tries to fit the remaining error.

• Eventually converges to good solution.


50

Gradient BoostingUses

Any type of predictive models (often used with Fraud and Customer Behavior analytics). Widely used in search engine ranking and the general field of learning to rank.

• Variable selection

• Comparison model

- Test gradient boosting versus regression to determine where there are two different populations in the data and possibly two models need to be built.


51

Gradient Boosting

Example: Iterations=0


52

Gradient Boosting



53

Gradient Boosting



54

Gradient Boosting



55

Gradient Boosting



56

Gradient Boosting



57

Gradient Boosting



58

Gradient Boosting



59

Gradient Boosting



60

Gradient Boosting


61

Gradient BoostingAdvantages



• Can handle complex functions



62

Gradient BoostingDisadvantages

• Can cause over-fitting

• Difficult to discover (visualize) non-linear transformations and multi-factor interactions

• Slightly more difficult to deploy the model into production

• “Black box” not easy to interpret

• Might not be legal to use in some industries (that is, consumer auto or credit)


63

Gradient BoostingSAS Enterprise Miner

N Iterations: how many iterations occur

Shrinkage: how influential each iteration is

Maximum Depth: complexity of each individual model


Random Forests


65

Random ForestWhat Is It?

• A combination of several “decision trees.”

• A random forest consists of a forest of fully trained decision trees (each with a random variation).

• The random forest averages the output of all the decision trees in the “forest.”


66

Random ForestUses

• Any type of predictive models. Usually used in applications where a decision tree or gradient boosting tree would be used.

• Often used with big data

• Variable selection

• Comparison model

- Test random forests versus regression to determine whether there are two different populations in the data and possibly two models need to be built.


67

Random ForestAlgorithm

Select a number of trees in the random forest.

For each tree in the forest, use the following split algorithm:

• Select a random sample of data.

• Select a random subset of variables.

• Determine the best split from the sample of data and the sample of variables.

• Keep selecting random data and random subsets of variables until the maximum number of trees is trained.

When all the trees are built, the prediction is the average of all trees.


68

Random ForestAdvantages

Fast training time for big data sets with lots of variables

• Can also determine a variable’s importance for predicting a target

Can handle outliers and missing values

Can handle complex functions

Consistently gives the same accuracy when data changes.

• Perturbs training data more than the bagging algorithm, producing more variation in the models.

• Ensembles of a more diverse set of trees often leads to improved predictive accuracy.


69

Random ForestDisadvantages


• More difficult to deploy the model into production




70

Random ForestsSAS Enterprise Miner

SAS Enterprise Miner 13.1, 13.2, 14.1 or 14.2 on SAS 9.4


71

Random ForestsSAS Enterprise Miner

Maximum Number of Trees: how many trees will be in the forest.

Sampling Strategy: specifies number of observations used in each sample and sampling method.

Number vars to consider in split search: how many input variables to consider when splitting each node.

(The default value is the square root of the number of inputs.)


Ensemble Forests


73

Ensemble Forests

What Is It?

• A combination of “decision trees”

• A collection of two or more decision trees where output is averaged

• Different from random forest in that trees are built one at a time by analyst

• Much smaller than a random forest

• Slower to develop trees one at a time


74

Ensemble Forests

Uses

Any type of predictive models where decision trees are used

Comparison model

• Test ensemble forests versus regression to determine whether there are two (or more) different populations in the data and possibly two (or more) models need to be built.


75

Ensemble Forests

Algorithm

• Select a number of trees in the ensemble forest.

• Build two or more decision trees using different parameters so that trees are different from one another.

• When all the trees are built, the prediction is the average of all trees.


76

Ensemble Forests


77

Ensemble Forests

Advantages




• Consistently gives the same accuracy when data changes


78

Ensemble Forests

Disadvantages





79

EnsembleModeling Algorithms

Creates new models by combining the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple predecessor models.

Three methods:

• Average

• Maximum

• Voting


Stacked Ensembles


81

Stacked EnsemblesWhat Is It?

• A variation on Ensemble node

• Generate as many different models as you like

• Use prediction values from those models as inputs into a new model


82

Stacked EnsemblesAlgorithm

• Generate many different models on a training data set, each with predicted values or predicted probabilities for the target.

• Generate a metadata set that includes the predicted values or predicted probabilities from each model.

• Can also include the original input variables

• Run another modeling algorithm using the new metadata set (stacked ensemble) to predict the target.

• Stacked ensembles can be as complicated or simple as you want.


83

Stacked Ensembles


84

Stacked EnsemblesMetadata: Change Predictions to Inputs


85

Stacked EnsemblesDecision Tree Output


86

Stacked EnsemblesModel Comparison Output


87

Stacked EnsemblesScore Code

Includes score code for all four original models and the final decision tree.


88

Stacked Ensembles

Advantages

• Enables us to use many different modeling algorithms





89

Stacked Ensembles

Disadvantages





Review


91

SAS Enterprise MinerEnsemble Models and Partitioning Algorithms

• Ensemble Models

• Decision Trees

• Perturb and Combine

• Bagging

• Boosting

• Gradient Boosting

• Random Forests (SAS 9.4)

• Ensemble Forests

• Stacked Ensembles


Resources


94

Learning MoreSAS Resources

SAS Global Forum Papers:

• Leveraging Ensemble Models in SAS® Enterprise Miner™ Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller

• Ensemble Modeling: Recent Advances and Application Wendy Czika, Miguel Maldonado, and Ye Liu

• Stacked Ensemble Models for Improved Prediction Accuracy Funda Güneş, Russ Wolfinger, and Pei-Yi Ta

Blog: Why do stacked ensemble models win data science competitions?

https://support.sas.com/resources/papers/proceedings14/SAS133-2014.pdf

https://support.sas.com/resources/papers/proceedings16/SAS3120-2016.pdf

http://support.sas.com/resources/papers/proceedings17/SAS0437-2017.pdf

http://blogs.sas.com/content/subconsciousmusings/2017/05/18/stacked-ensemble-models-win-data-science-competitions/


95

Learning More

Decision Trees for Analytics Using SAS® Enterprise Miner™

By: Barry de Ville and Padraic Neville

ISBN: 978-1-61290-315-6

Copyright Date: July 2013

SAS Bookstore: https://www.sas.com/store/prodBK_63319_en.html

Table of Contents [PDF]

Free Chapter [PDF]

Example Code and Data

Available on Amazon

https://www.sas.com/store/prodBK_63319_en.html

http://support.sas.com/publishing/pubcat/tocs/63319.pdf

http://support.sas.com/publishing/pubcat/chaps/63319.pdf

http://ftp.sas.com/samples/A63319

http://amzn.to/2roFgaH


96

Learning More

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions

By: Giovanni Seni and John Elder

ISBN-10: 1608452840

Publisher: Morgan and Claypool Publishers (February 24, 2010)

Available on Amazon

http://amzn.to/2roMQ4R


97

Learning MoreAcademic References

Bauer, E. and Kohavi, R. 1999. “An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants.” Machine Learning 36:105-169, 1999.

Breiman, L. 1996. “Bagging Predictors.” Machine Learning 24:123-140.

Breiman, L. 1998. “Arcing Classifiers (with discussion). “ Annals of Statistics 26: 801-849.

Breiman, L. 2001. “Random Forests.” Machine Learning Volume 45: 5-32.

de Ville, Barry 2006. Decision Trees for Business Intelligence and Data Mining: Using SAS Enterprise Miner. Cary, NC.

Friedman, Jerome H. 2001 “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29: 1189–1232.

Friedman, Jerome H. 2002. “Stochastic Gradient Boosting.” Computations Statistics & Data Analysis 38: 367–378.


98

SAS Online CommunityCommunities.sas.com/data-mining


Questions?

Thank you for your time and attention!

Connect with me:LinkedIn: https://www.linkedin.com/in/melodierushTwitter: @Melodie_Rush

https://www.linkedin.com/in/melodierush

Ensemble Models and Partitioning Algorithms in SAS® Enterprise … · 2018. 2. 2. · decision...

Documents

Transcript of Ensemble Models and Partitioning Algorithms in SAS® Enterprise … · 2018. 2. 2. · decision...