SASPDF_VERYIMP

Enterprise Miner:Applying

Data MiningTechniques

Course Notes

ii

Enterprise Miner™: Applying Data Mining Techniques Course Notes was written byDoug Wielenga, Bob Lucas, and Jim Georges. Additional contributions were made by ManyaEliadis and William Potts.

6$6�,167,787(�75$'(0$5.6

The SAS System is an integrated system of software providing complete control over data access, management, analysis, andpresentation. Base SAS software is the foundation of the SAS System. Products within the SAS System include SAS/ACCESS,SAS/AF, SAS/ASSIST, SAS/CALC, SAS/CONNECT, SAS/CPE, SAS/DMI, SAS/EIS, SAS/ENGLISH, SAS/ETS,SAS/FSP, SAS/GIS, SAS/GRAPH, SAS/IML, SAS/IMS-DL/I, SAS/INSIGHT, SAS/LAB, SAS/MDDB,SAS/NVISION, SAS/OR, SAS/PH-Clinical, SAS/QC, SAS/REPLAY-CICS, SAS/SESSION, SAS/SHARE,SAS/SPECTRAVIEW, SAS/STAT, SAS/TOOLKIT, SAS/TUTOR, SAS/DB2, SAS/GEO, SAS/IntrNet,SAS/PH-Kinetics, SAS/SECURE, SAS/SHARE*NET, SAS/SQL-DS, and SAS/Warehouse Administrator software.Other SAS Institute products are SYSTEM 2000 Data Management Software, with basic SYSTEM 2000, CREATE,Multi-User, QueX, Screen Writer, and CICS interface software; InfoTap software; JMP, JMP IN, JMP Serve, andStatView software; SAS/RTERM software; the SAS/C Compiler; Video Reality software; Warehouse Viewer software;Budget Vision, Campaign Vision, CFO Vision, Enterprise Miner, Enterprise Reporter, HR Vision,IT Charge Manager, and IT Service Vision software; Scalable Performance Data Server software;SAS OnlineTutor software; and Emulus software. MultiVendor Architecture, MVA, MultiEngine Architecture,MEA, Risk Dimension, and SAS inSchool are trademarks of SAS Institute Inc. SAS Institute also offers SAS Consulting

and SAS Video Productions services. Authorline, Books by Userssm, The Encore Series, ExecSolutions, JMPer Cable,Observations, SAS Communications, SAS.COM, SAS OnlineDoc, SAS Professional Services, the SASware Ballot,SelecText, and Solutions@Work documentation are published by SAS Institute Inc. The SAS Video Productions logo, theBooks By Users SAS Institute’s Author Service logo, the SAS Online Samples logo, and The Encore Series logo are registeredservice marks or registered trademarks of SAS Institute Inc. The Helplus logo, the SelecText logo, the Video Reality logo, theQuality Partner logo, the SAS Business Solutions logo, the SAS Rapid Warehousing Program logo, the SAS Publications logo,the Instructor-based Training logo, the Online Training logo, the Trainer’s Kit logo, and the Video-based Training logo are servicemarks or trademarks of SAS Institute Inc. All trademarks above are registered trademarks or trademarks of SAS Institute Inc. inthe USA and other countries. indicates USA registration.

The Institute is a private company devoted to the support and further development of its software and relatedservices.

Other brand and product names are registered trademarks or trademarks of their respective companies.

Enterprise Miner™: Applying Data Mining Techniques Course Notes

Copyright 1999 by SAS Institute Inc., Cary, NC 27513, USA. All rights reserved. Printed in the United States ofAmerica. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form orby any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of thepublisher, SAS Institute Inc.

Book code 56606, course code DMEM, prepared 26JUL99.

iii

Table of Contents

Chapter 1 Getting Started with Enterprise Miner ........................................................................... 1

1.1 Getting Started.................................................................................................................. 3

1.2 Data Mining Using SEMMA ........................................................................................... 5

1.3 Accessing Data in SAS.................................................................................................... 12

1.4 Building a Sample Flow (optional)................................................................................ 13

Chapter 2 Predictive Modeling ........................................................................................................ 25

2.1 Problem Formulation ..................................................................................................... 27

2.2 Data Preparation and Investigation.............................................................................. 28

2.3 Fitting and Comparing Candidate Models................................................................... 63

2.4 Generating and Using Scoring Code............................................................................. 73

2.5 Generating a Report Using the Reporter Node............................................................ 82

Chapter 3 Variable Selection............................................................................................................ 83

3.1 Introduction to Variable Selection ................................................................................ 85

3.2 Using the Variable Selection Node................................................................................. 86

3.3 Using the Tree Node ....................................................................................................... 90

Chapter 4 Neural Networks.............................................................................................................. 93

4.1 Visualizing Neural Networks ......................................................................................... 95

4.2 Visualizing Logistic Regression ................................................................................... 102

Table of Contents

iv

Chapter 5 Decision Trees ................................................................................................................ 103

5.1 Introduction to Decision Trees..................................................................................... 105

5.2 Problem Formulation ................................................................................................... 106

5.3 Understanding Tree Results......................................................................................... 107

5.4 Understanding and Using Tree Options ..................................................................... 115

5.5 Interactive Training...................................................................................................... 120

5.6 Choosing a Decision Threshold ................................................................................... 125

Chapter 6 Cluster Analysis............................................................................................................. 127


6.2 K-means Clustering...................................................................................................... 130

6.3 Self-Organizing Maps (SOMs) .................................................................................... 139

6.4 Generating and Using Scoring Code........................................................................... 145

Chapter 7 Associations..................................................................................................................... 147


7.2 Understanding Association Results ............................................................................. 150

7.3 Dissociation Analysis .................................................................................................... 153

v

Course DescriptionThis course notes covers the basic skills required to assemble analyses using the rich toolset of Enterprise Miner (Version 3).

To learn more…

PrerequisitesBefore taking this course,• you should be familiar with Microsoft Windows and Windows-based software

• it is recommended you complete the Data Mining Primer: Overview ofApplications and Methods course.

A full curriculum of general and statistical instructor-based trainingis available at any of the Institute’s training facilities. Instituteinstructors can also provide on-site training.

For information on other courses in the curriculum, contact theProfessional Services Division at 1-919-677-8000, then press1-7321, or send email to [email protected]. You can also findthis information on the Web at www.sas.com/training/ as wellas in the Training Course Catalog.

For a list of other SAS books that relate to the topics covered in thiscourse notes, USA customers can contact our Book SalesDepartment at 1-800-727-3228 or send email to [email protected] outside the USA, please contact your local SAS Instituteoffice.

See the Publications Catalog on the Web at www.sas.com/pubsfor a complete list of books and a convenient order form.

vi

General ConventionsThis section explains the various conventions used in presenting text, SAS languagesyntax, and examples in this course notes.

Typographical Conventions

This course notes uses several type styles. This list displays the meaning of each style:

UPPERCASE ROMAN is used for SAS statements, variable names, and otherSAS language elements when they appear in the text.

italic identifies terms or concepts that are defined in text. Italic isalso used for book titles when they are referenced in text, aswell as for various syntax and mathematical elements.

bold is used for emphasis within text.PRQRVSDFH is used for examples of SAS programming statements and

for SAS character strings. Monospace is also used to referto field names in windows, information in fields, anduser-supplied information.

select indicates selectable items in windows and menus. Thiscourse notes also uses icons to represent selectable items.

Mouse Conventions

The number of buttons on mouse devices varies. On mouse devices with two or threebuttons, one button makes selections and one displays pop-up menus. Because thelocations of these buttons vary, this course notes references them as mouse select buttonor the mouse menu button. If you use a mouse device, you can determine which buttonexecutes which action by trying them.

menu buttonselect button

Two-Button Mouse with Default Settings

Chapter 1: First Steps

1.1 Getting Started.................................................................................................................. 3

1.2 Data Mining Using SEMMA ........................................................................................... 5

1.3 Accessing Data in SAS.................................................................................................... 12

1.4 Building a Sample Flow (optional)................................................................................ 13


2

3

1.1 Getting Started

Opening The Enterprise Miner

To start the Enterprise Miner, double-click on the Enterprise Miner icon on your desktop.If no icon is available and you are running on Windows, use the Start menu and selectStart È Programs È Enterprise Miner È Enterprise Miner 3.0. If you are in atraining class, follow the trainer’s instructions to open the Enterprise Miner.

Setting Up The Initial Project and Diagram

1. Select File È New È Project.2. Type in name of project (for example, "My Project").3. Check the box for Client/server project if needed.

Note: You must have the access to a server running the same version of theEnterprise Miner. Do not check this box unless instructed to do so by theinstructor.

4. Modify the location of the project folder if desired by selecting Browse.5. Select Create. The project opens with an initial untitled diagram.

6. Click on the diagram title and type in a new title if desired (for example, "My FirstFlow").

After Selecting Name Final Appearance


4

Identifying the Workspace Components

7. Observe that the project window opens with the Diagrams tab activated. Select theTools tab located to the right of the Diagrams tab in the lower-left portion of theproject window. This tab enables you to see all of the tools (or "nodes") that areavailable in the Enterprise Miner.

Many of the commonly used tools are shown on the toolbar at the top of the window.If you desire to have additional tools in this toolbar, you can drag them from thewindow above onto the toolbar. In addition, you can rearrange the tools on the toolbarby dragging each tool to the desired location on the bar.

8. Select the Reports tab located to the right of the Tools tab. This tab reveals anyreports that have been generated for this project. This is a new project, so no reportsare currently available.

9. Return to the Tools tab.

5

1.2 Data Mining using SEMMA

Understanding SEMMA

The tools are arranged according the SAS process for Data Mining, SEMMA.

SEMMA stands for

Sample - identify input data sets (identify input data, sample from a larger data set,partition data set into training, validation, and test data sets).

Explore - explore data set statistically and graphically (plot the data, obtain descriptivestatistics, identify important variables, perform association analysis).

Modify - prepare the data for analysis (create additional variables or transform existingvariables for analysis, identify outliers, impute missing values, modify the way in whichvariables are used for the analysis, perform cluster analysis, analyze data with SOMs orKohonen networks).

Model - fit a predictive model (model a target variable using a regression model, adecision tree, a neural network, or a user-defined model).

Assess - compare competing predictive models (build charts plotting percentage ofrespondents, percentage of respondents captured, lift charts, profit charts).

Additional tools are available under the Utilities group.


6

Overview of the Nodes

Sample Nodes

The Input Data Source node reads data sources and defines their attributes forlater processing by Enterprise Miner. This node can perform various tasks:

1. It enables you to access SAS data sets and data marts. Data marts can be definedusing the SAS Data Warehouse Administrator and set up for Enterprise Miner usingthe Enterprise Miner Warehouse Add-ins.

2. It automatically creates the metadata sample for each variable when you import a dataset with the Input Data Source node.

3. It sets initial values for the measurement level and the model role for each variable.You can change these values if you are not satisfied with the automatic selectionsmade by the node.

4. It displays summary statistics for interval and class variables.5. It enables you to define target profiles for each target in the input data set.

Note: For the purposes of this document, data sets and data tables are equivalent terms.

The Sampling node enables you to take random, stratified random samples, andcluster samples of data sets. Sampling is recommended for extremely

large databases because it can significantly decrease model training time. If the sample issufficiently representative, relationships found in the sample can be expected togeneralize to the complete data set. The Sampling node writes the sampled observationsto an output data set and saves the seed values that are used to generate the randomnumbers for the samples so that you may replicate the samples.

The Data Partition node enables you to partition data sets into training, test, andvalidation data sets. The training data set is used for preliminary model fitting.

The validation data set is used to monitor and tune the model weights during estimationand is also used for model assessment. The test data set is an additional holdout data setthat you can use for model assessment. This node uses simple random sampling, stratifiedrandom sampling, or user-defined partitions to create partitioned data sets.

1.2 Data Mining Using SEMMA

7

Explore Nodes

The Distribution Explorer node is a visualization tool that enables you quicklyand easily to explore large volumes of data in multidimensional histograms. You

can view the distribution of up to three variables at a time with this node. When thevariable is binary, nominal, or ordinal, you can select specific values to exclude from thechart. To exclude extreme values for interval variables, you can set a range cutoff. Thenode also generates summary statistics for the charting variables.

The Multiplot node is another visualization tool that enables you to explore largervolumes of data graphically. Unlike the Insight or Distribution Explorer nodes,

the Multiplot node automatically creates bar charts and scatter plots for the input andtarget variables without making several menu or window item selections. The codecreated by this node can be used to create graphs in a batch environment, whereas theInsight and Distribution Explorer nodes must be run interactively.

The Insight node enables you to open a SAS/INSIGHT session.SAS/INSIGHT software is an interactive tool for data exploration and analysis.

With it you explore data through graphs and analyses that are linked acrossmultiple windows. You can analyze univariate distributions, investigatemultivariate distributions, and fit explanatory models using generalized linear models.

The Association node enables you to identify association relationships within thedata. For example, if a customer buys a loaf of bread, how likely is the customer

to also buy a gallon of milk? The node also enables you to perform sequence discovery ifa time stamp variable (a sequence variable) is present in the data set.

The Variable Selection node enables you to evaluate the importance ofinput variables in predicting or classifying the target variable. To select

the important inputs, the node uses either an R-square or a Chi-square selection (treebased) criterion. The R-square criterion enables you to remove variables in hierarchies,remove variables that have large percentages of missing values, and remove classvariables that are based on the number of unique values. The variables that are not relatedto the target are set to a status of rejected. Although rejected variables are passed tosubsequent nodes in the process flow diagram, these variables are not used as modelinputs by a more detailed modeling node, such as the Neural Network and Tree nodes.You can reassign the input model status to rejected variables.


8

Modify Nodes

The Data set Attributes node enables you to modify data set attributes, such asdata set names, descriptions, and roles. You can also use this node to modify the

metadata sample that is associated with a data set and specify target profiles for a target.An example of a useful Data Set Attributes application is to generate a data set in theSAS Code node and then modify its metadata sample with this node.

The Transform Variables node enables you to transform variables; forexample, you can transform variables by taking the square root of a variable, by

taking the natural logarithm, maximizing the correlation with the target, or normalizing avariable. Additionally, the node supports user-defined formulas for transformations andprovides a visual interface for grouping interval-valued variables into buckets orquantiles. This node also automatically bins interval variables into buckets using adecision tree based algorithm. Transforming variables to similar scale and variability mayimprove the fit of models and, subsequently, the classification and prediction precision offitted models.

The Filter Outliers node enables you to identify and remove outliers from datasets. Checking for outliers is recommended as outliers may greatly affect

modeling results and, subsequently, the classification and prediction precision of fittedmodels.

The Replacement node enables you to impute (fill in) values for observationsthat have missing values. You can replace missing values for interval

variables with the mean, median, midrange, mid-minimum spacing, distribution basedreplacement, or use a replacement M-estimator such as Tukey’s biweight, Huber’s, orAndrew’s Wave. You can also estimate the replacement values for each interval input byusing a tree-based imputation method. Missing values for class variables can be replacedwith the most frequently occurring value, distribution-based replacement, tree-basedimputation, or a constant.

The Clustering node enables you to segment your data; that is, it enables you toidentify data observations that are similar in some way. Observations that are

similar tend to be in the same cluster, and observations that are different tend to be indifferent clusters. The cluster identifier for each observation can be passed to other nodesfor use as an input, ID, or target variable. It can also be passed as a group variable thatenables you to automatically construct separate models for each group.

The SOM/Kohonen node generates self-organizing maps, Kohonennetworks, and vector quantization networks. Essentially the node performs

unsupervised learning in which it attempts to learn the structure of the data. As with theClustering node, after the network maps have been created, the characteristics can beexamined graphically using the results browser. The node provides the analysis results inthe form of an interactive map illustrating the characteristics of the clusters. Furthermore,it provides a report indicating the importance of each variable.


9

Model Nodes

The Regression node enables you to fit both linear and logistic regression modelsto your data. You can use continuous, ordinal, and binary target variables. You

can use both continuous and discrete variables as inputs. The node supports the stepwise,forward, and backward selection methods. A point-and-click interaction builder enablesyou to create higher-order modeling terms.

The Tree node enables you to perform multi-way splitting of your database basedon nominal, ordinal, and continuous variables. This is the SAS System

implementation of decision trees, which represents a hybrid of the best of CHAID,CART, and C4.5 algorithms. The node supports both automatic and interactivetraining. When you run the Tree node in automatic mode, it automatically ranks the inputvariables based on the strength of their contribution to the tree. This ranking may be usedto select variables for use in subsequent modeling. In addition, dummy variables can begenerated for use in subsequent modeling. You may override any automatic step with theoption to define a splitting rule and prune explicit nodes or subtrees. Interactive trainingenables you to explore and evaluate a large set of trees as you develop them.

The Neural Network node enables you to construct, train, and validatemultilayer feed-forward neural networks. By default, the Neural Network node

automatically constructs a multilayer feed-forward network that has one hidden layerconsisting of three neurons. In general, each input is fully connected to the first hiddenlayer, each hidden layer is fully connected to the next hidden layer, and the last hiddenlayer is fully connected to the output. The Neural Network node supports manyvariations of this general form.

The User Defined Model node enables you to generate assessment statistics usingpredicted values from a model that you built with the SAS Code node (for

example, a logistic model using the SAS/STAT LOGISTIC procedure) or the VariableSelection node. The predicted values can also be saved to a SAS data set and thenimported into the process flow with the Input Data Source node.


10

The Ensemble node enables you to combine models. The usual combinationfunction is the mean. Ensemble models are expected to exhibit greater stability

than individual models. They are most effective when the individual models exhibitlower correlations. The node creates three different types of ensembles:1. Combined model - For example, combining a decision tree and a neural network

model. The combination function is the mean of the predicted values.2. Stratified model - Performing group processing over variables values. In this case,

there is no combination function because each row in the data set is scored by a singlemodel that is dependent on the value of one or more variables.

3. Bagging/Boosting models - Performing group processing with resampling. Thecombination function is the mean of the predicted values. Each observation in the dataset is scored by n models and the probabilities are averaged. The only differencebetween bagging and boosting is that with boosting an intermediary data set isscored for use by the resampling algorithm.

Note: These modeling nodes utilize a directory table facility, called the Model Manager,in which you can store and assess models on demand. The modeling nodes also enableyou to modify the target profile(s) for a target variable.

Assess Nodes

The Assessment node provides a common framework for comparing modelsand predictions from any of the modeling nodes (Regression, Tree, Neural

Network, and User Defined Model nodes). The comparison is based on the expected andactual profits or losses that would result from implementing the model. The nodeproduces the following charts that help to describe the usefulness of the model: lift,profit, return on investment, receiver operating curves, diagnostic charts, and threshold-based charts.

The Score node enables you to generate and manage predicted values from atrained model. Scoring formulas are created for both assessment and

prediction. Enterprise Miner generates and manages scoring formulas in the form ofSAS DATA step code, which can be used in most SAS environments even withoutthe presence of Enterprise Miner.

The Reporter node assembles the results from a process flow analysis into anHTML report that can be viewed with your favorite web browser. Each report

contains header information, an image of the process flow diagram, and a separatereport for each node in the flow. Reports are managed in the Reports tab of theProject Navigator.


11

Utility Nodes

The Group Processing node enables you to perform group by processing for classvariables such as GENDER. You can also use this node to analyze

multiple targets, and process the same data source repeatedly by setting the group-processing mode to index.

The Data Mining Database node enables you to create a data miningdatabase (DMDB) for batch processing. For non-batch processing, DMDBs are

automatically created as they are needed.

The SAS Code node enables you to incorporate new or existing SAScode into process flow diagrams. The ability to write SAS code enables you to

include additional SAS System procedures into your data mining analysis. You canalso use a SAS DATA step to create customized scoring code, to conditionallyprocess data, and to concatenate or to merge existing data sets. The node provides amacro facility to dynamically reference data sets used for training, validation, testing, orscoring and variables, such as input, target, and predict variables. After you run the SASCode node, the results and the data sets can then be exported for use by subsequent nodesin the diagram.

The Control Point node enables you to establish a control point to reduce thenumber of connections that are made in process flow diagrams. For

example, suppose three Input Data Source nodes are to be connected to three modelingnodes. If no Control Point node is used, then nine connections are required to connect allof the Input Data Source nodes to all of the modeling nodes. However, if a Control Pointnode is used, only six connections are required.

The Subdiagram node enables you to group a portion of a process flowdiagram into a subdiagram. For complex process flow diagrams, you may want to

create subdiagrams to better design and control the process flow.

Some General Usage Rules for Nodes

These are some general rules that govern placing nodes in a process flow diagram (PFD):

• The Input Data Source cannot be preceded by any other node.

• The Sampling node must be preceded by a node that exports a data set.

• The Assessment node must be preceded by one or more modeling nodes.

• The Score node must be preceded by a node that produces score code. Forexample, the Modeling nodes produce score code.

• The SAS Code node can be defined in any stage of the process flow diagram. Itdoes not require an input data set defined in the Input Data Source node.

12

1.3 Accessing Data in SAS

Using SAS Libraries

SAS uses libraries to organize files. These libraries point to folders where data and programs arestored. Libraries must conform to the naming conventions used in SAS 6.12. These conventionsrequire the library name to have no more than eight alphanumeric characters, and the namecannot contain special characters such as asterisks (*) and ampersands (&). To create a newlibrary or to view existing libraries, use the Globals menu and select Access È Displaylibraries.

You can see the files in a library by selecting the library name from the list of libraries in theupper-left portion of the dialog. To create a new library, say CRSSAMP, select New Library andfill in the resulting dialog with the desired library name and associated path. The followinglibrary identifies the folder whose path is C:\workshop\bsd\dmem.

Observe that the box for Assign automatically at startup is checked. This library will bereassigned every time the SAS session starts. If you do not check this box and close your SASsession, the library name is not stored so the data will be unavailable for use by the SAS Systemor the Enterprise Miner in later sessions unless you reassign the library name. Select Assign tofinish assigning the library name.

For the purposes of these notes, assume the raw data is placed in this folder, which is identifiedby the library name CRSSAMP. Any data in the folder data set (say HMEQ) can then bereferenced by the two-part name CRSSAMP.HMEQ.

13

1.4 Building a Sample Flow (Optional)

Problem Formulation

The data for your first example comes from a financial services company that extends a line ofcredit to homeowners. After analyzing the data, a subset of 12 predictor variables was selected tomodel the response to a mailing. The response variable indicates whether or not someonedefaulted on the home equity line-of-credit (BAD).

Name ModelRole

MeasurementLevel

Description

BAD Target Binary 1=client defaulted on loan, 0=loan repaidCLAGE Input Interval Age of oldest trade line in monthsCLNO Input Interval Number of trade linesDEBTINC Input Interval Debt-to-income ratioDELINQ Input Interval Number of trade linesDEROG Input Interval Number of major derogatory reportsJOB Input Nominal Six occupational categoriesLOAN Input Binary Amount of the loan requestMORTDUE Input Interval Amount due on existing mortgageNINQ Input Interval Number of recent credit inquiriesREASON Input Binary DebtCon=debt consolidation,

HomeImp=home improvementVALUE Input Interval Value of current propertyYOJ Input Interval Years at present job

The HMEQ data set in the CRSSAMP library contains 5,960 observations for building andcomparing competing models. This data set will be split into training, validation, and test datasets for analysis.


14

Building the Initial Flow

Begin building the first flow to analyze this data. The toolbar provides you with easy access tomany of the commonly used nodes. You can add additional nodes to the toolbar by dragging thenodes from the Tools tab to the toolbar. All of the nodes will remain available in the Tools tab.

Add an Input Data Source node by dragging the node from the toolbar or from the Tools tab.Since this is a predictive modeling flow, add a Data Partition node to the right of the Input DataSource node. In addition to dragging a node onto the workspace, there are two other ways to adda node to the flow. You can right-click in the workspace where you want the node to appear andselect Add node from the pop-up menu that appears, or you can simply double-click where youwant the node to appear. In either case, a list of nodes appears and you need only to select thedesired node. After selecting Data Partition, you diagram should look as follows.

Observe that the Data Partition node is selected (as indicated by the dotted line around it) but theInput Data Source node is not. If you right-click in any open space on the workspace, all nodesbecome deselected.

Using the Cursor

The shape of the cursor changes depending on where it is positioned. The behavior of the mousecommands depends on the shape as well as the selection state of the node over which the cursoris positioned. Right-click in an open area to see the menu. The last three menu items (Connectitems, Move items, Move and Connect) enable you to modify the ways in which the cursor maybe used. Move and Connect is selected by default, and it is highly recommended that you do notchange this setting! If your cursor is not performing a desired task, check this menu to make surethat Move and Connect is selected. This selection allows you to move the nodes around theworkspace as well as connect them.

Observe that when you put your cursor in the middle of a node, the cursor appears as a hand.You can move the nodes around the workspace as follows:1. Position the cursor in the middle of the node (until the hand appears).2. Press the left mouse button and drag the node to the desired location.3. Release the left mouse button.

1.4 Building a Sample Flow (optional)

15

Note that after dragging a node, the node will remain selected. To deselect all of the nodes, clickin an open area of the workspace. Also note that when you put the cursor on the outside edge ofthe node, the cursor appears as a cross-hair. You can connect the node where the cursor ispositioned (beginning node) to any other node (ending node) as follows:

1. Ensure that the beginning node is deselected. It is much easier to drag a line when the node isdeselected. If the beginning node is selected, click in an open area of the workspace todeselect it.

2. Position the cursor on the edge of the icon representing the beginning node(until the cross-hair appears).

3. Press the left mouse button and immediately begin to drag in the direction of the endingnode. Note: If you do not begin dragging immediately after pressing the left mouse button,you will only select the node. Dragging a selected node will generally result in moving thenode (no line will form).

4. Release the mouse button after reaching the edge of the icon representing the ending node.5. Click away from the arrow. Initially, the connection will appear as follows. After clicking

away from the line, the finished arrow forms.

Initial Appearance Final Appearance

Identifying the Input Data

The first example uses the HMEQ data set in the CRSSAMP library. To specify the input data,double-click on the Input Data Source node or right-click on this node and select Open. TheData tab is active. Your window should look like the one below.

Click on Select in order to select the data set. Alternatively, you can enter the name of the dataset.


16

The SASUSER library is selected by default. To view data sets in the CRSSAMP library, clickon the and select CRSSAMP from the list of defined libraries.

Select the HMEQ data set from the list of data sets in the CRSSAMP library and then select OK.The resulting dialog appears below.

Observe that this data set has 5,960 observations (rows) and 13 variables (columns). Observe thatthe field next to Source Data: contains CRSSAMP.HMEQ. You could have typed in this nameinstead of selecting it through the dialog. Note that the lower-right corner indicates a metadatasample of size 2,000. What exactly is a metadata sample?

Understanding The Metadata Sample

All analysis packages must determine how to use variables in the analysis. The Enterprise Minerutilizes metadata in order to make a preliminary assessment of how to use each variable. Bydefault, it takes a random sample of 2,000 observations from the data set of interest, and uses thisinformation to assign a model role and a measurement level to each variable. If you wish to takea larger sample, you may select the Change button in the lower-right corner of the dialog, butthat is unneccessary and is not shown here.

Evaluate (and update, if necessary) the assignments that were made using the metadata sample.Click on the Variables tab to see all of the variables and their respective assignments. Click onthe first column heading entitled Name to sort the variables by their name. You can see all of thevariables if you maximizing the window. The following table shows a portion of the informationfor each of the 13 variables.


17

Observe that two of the columns are grayed out. These columns represent information from theSAS data set that cannot be changed in this node. The Name must conform to the namingconventions described earlier for libraries. The Type is either character (char) or numeric (num)and affects how a variable can be used. The value for Type and the number of levels in themetadata sample of 2,000 is used to identify the model role and measurement level.

The first variables is BAD, which will be the Target variable. Although BAD is a numericvariable in the data set, the Enterprise Miner identifies it as a binary varible since it has only twodistinct non-missing levels in the metadata sample. The model role for all binary variables is setto input by default. You will need to change the model role for BAD to target before performingthe analysis.

The next three variable (LOAN through VALUE) have the measurement level interval since theyare numeric variables in the SAS data set and have more than 10 distinct levels in the metadatasample. The model role for all interval variables is set to input by default.

The variables REASON and JOB are both character variables in the data set yet they havedifferent measurement levels. REASON is binary since it has only two distinct non-missinglevels in the metadata sample. The model role for JOB, however, is nominal since it is acharacter variables with more than two levels.

For the purpose of this analysis, treat the remaining variables (YOJ through DEBTINC) asinterval variables. Notice that in the table above, DEROG and DELINQ have been assigned themodel role of ordinal. These two variables are listed as ordinal variables because each is anumeric variable with more than two but no more than ten distinct non-missing levels in themetadata sample. This often occurs with counting variables such as a variable for the number ofchildren. Since this assignment depends on the metadata sample, the measurement level ofDEROG and/or DELINQ for your analysis may be set to interval. All ordinal variables are set tohave the input model role; however, you will treat these variables as interval inputs for thepurpose of this analysis.


18

Identifying Target Variables

BAD is the response variables for this analysis. Change the model role for BAD to target.To modify the model role information, proceed as follows:1. Position the tip of your cursor over the row for BAD in the model role column and right-

click.2. Select Set Model Role È target from the pop-up menu.

Inspecting Distributions

You can inspect the distribution of values in the metadata sample for each of the variables. Toview the distribution of BAD, proceed as follows:1. Position the tip of your cursor over the variable BAD in the Name column.2. Right-click and observe that you can Sort by name, Find name, or View distribution of BAD.3. Select View distribution to see the distribution of values for BAD in the metadata sample.

To obtain additional information, select the the View Info tool ( ) from the toolbar at the topof the window and click one of the bars. The Enterprise Miner displays the level and theproportion of observations represented by the bar. These plots provide an initial overview of thedata. For this example, approximately 20% of the observations were loans where the clientdefaulted. Since the plots are based on the metadata sample, they may vary slightly due to thedifferences in the sampled observations, but the bar for BAD=1 should represent approximately20% of the data. Select Close to return to the main dialog when you are finished inspecting theplot. Evaluate the distribution of other variables as desired.


19

Modifying Variable Information

Ensure that the remaining variables have the correct model role and measurement levelinformation. If necessary, change the model role for DEROG and DELINQ to interval. Tomodify the model role information for DEROG, proceed as follows:1. Position the tip of your cursor over the row for DEROG in the model role column and right-

click.2. Select Set Model Role È input from the pop-up menu.3. Repeat steps 1 and 2 for DELINQ.

Alternatively, you could have updated the model role information for both variablessimultaneously by highlighting the rows for DEROG and DELINQ simultaneously beforefollowing steps 1 and 2 above.

Investigating Descriptive Statistics

The metadata is used to compute descriptive statistics. Select the Interval Variables tab.

Investigate the minimum value, maximum value, mean, standard deviation, percentage ofmissing observations, skewness, and kurtosis for interval variables. Inspecting the minimum andmaximum values indicates no unusual values. Observe that DEBTINC has a high percentage ofmissing values (22%). Select the Class Variables tab.

Investigate the number of levels, percentage of missing values, and the sort order of eachvariable. Observe that the sort order for BAD is descending while the sort order for all the othersis ascending. This occurs since you have a binary target event. It is common to code a binarytarget with a "1" when the event occurs and a "0" otherwise. Sorting in descending order makeslevel "1" the first level, which is the target event for a binary variable. It is useful to sort othersimilarly coded binary variables in descending order for interpreting parameter estimates in aregression model. Close the Input Data Source node, saving changes when prompted.


20

Inspecting Default Settings in the Data Partition Node

Open the Data Partition node.

The upper-left corner enables you to choose the method for partitioning.

By default, Enterprise Miner takes a simple random sample of the input data and divides it intotraining, validation, and test data sets. Although it is not done here, to perform• Stratified sampling, select the Stratified radio button and then use the options in the Stratified

tab to set up your strata• User Defined sampling, select the User Defined button and then use the options in the User

Defined tab to identify the variable in the data set that identifies the partitions.

The lower-left corner enables you to specify a random seed for initializing the sampling process.Randomization within computer programs is often started by some type of seed. If you use thesame data set with the same seed in different flows, you will get the same partition. Observe thatresorting the data will result in a different ordering of data and therefore a different partition willyield potentially different results.

The right side enables you to specify the percentage of the data to allocate to training, validation,and test data.

Use the default settings for this example. Close the Data Partition node. If you did not makechanges, you will not be prompted to save changes. If prompted to save changes when closingthis node, select No to retain the default settings of the Data Partition node.


21

Understanding Data Replacement

Add a Replacement node to the diagram. This will allow you to impute missing values for eachvariable. This replacement is necessary to utilize all of the observations in the training data forbuilding a regression or neural network model. Decision trees handle missing values directly,while regression and neural network models ignore all observations with missing values on anyof the input variables. It is more appropriate to compare models built on the same set ofobservations, so you should perform data replacement before any regression or neural networkmodel when you plan on comparing the results to those obtained from a decision tree model.

Your new diagram should appear as follows:

Fitting A Regression Model

The Regression node will fit models for both continuous and categorical targets. The noderequires you to specify a target variable in the Input Data Source. Since you selected a binaryvariable (BAD) as the target in the Input Data Source node, the Regression node will by defaultfit a binary logistic regression model using all main effects. The node will also code yourgrouping variables using either GLM (or dummy) coding or Deviation (or effect) coding. Bydefault, the node uses effect coding for categorical input variables.

Connect a Regression node. The diagram should now look like the illustration below.

Evaluating the Model

Add an Assessment node to the diagram. Your flow should now look as follows.

Right-click on the Assessment node and select Run. View the results when prompted.


22

Observe that each node becomes green as it runs. Since you ran the flow from the Assessmentnode, you are prompted to see the Assessment results. Select Tools È Lift Chart.

A Cumulative %Response chart is shown by default. By default, this chart arranges people intodeciles based on their predicted probability of response, and then plots the actual percentage ofrespondents. To see actual values, click on the View Info tool and then click on the red line.Clicking on the red line near the upper-left corner of the plot indicates a %Response of 65.88,but what does that mean? To interpret the Cumulative %Response chart, consider how the chartis constructed.1. For this example, a responder is defined as someone who defaulted on a loan (BAD=1). For

each person, the fitted model (in this case, a regression model) predicts the probability thatthe person will default. Sort the observations by the predicted probability of response fromthe highest probability of response to the lowest probability of response.

2. Group the people into ordered bins, each containing approximately 10% of the data.3. Using the target variable BAD, count the percentage of actual responders in each bin.

If the model is useful, the proportion of responders (defaulters) will be relatively high in binswhere the predicted probability of response is high. The cumulative response curve shown aboveshows the percentage of respondents in the top 10%, top 20%, top 30%, and so on. In the top10%, almost 2/3 of the people were defaulters. In the top 20%, the proportion of defaulters hasdropped to just over 1/2 of the people. The blue line represents the baseline rate (approximately20%) for comparison purposes, which is an estimate of the percentage of defaulters that youwould expect if you were to take a random sample. The plot above represents cumulativepercentages, but you can also see the proportion of responders in each bin by selecting the radiobutton next to Non-cumulative on the left side of the graph.


23

Select the radio button next to Non-cumulative and inspect the plot.

Cumulative Response Non-Cumulative %Response

Select the Cumulative button and then select Lift Value. Lift charts plot the same informationon a different scale. Recall that the population response rate is about 20%. A lift chart can beobtained by dividing the response rate in each decile by the population response rate. The liftchart, therefore, plots relative improvement over baseline.

Cumulative %Response Cumulative Lift Value

Recall that the percentage of respondents in the first decile was 65.88%. Dividing 65.88% byabout 20% (baseline rate) gives a number slightly higher than three indicating you would expectto get over three times as many responders in this decile as you would from taking a simplerandom sample of the same size.


24

Instead of asking the question, "What percentage of observations in a bin were responders?", youcould ask the question, "What percentage of the total number of responders are in a bin?". Thiscan be evaluated using the Captured Response curve. To inspect this curve, select the radiobutton next to %Captured Response. Use the View Info tool to evaluate how the model performs.

Observe that if the percentage of applications chosen for rejection was approximately• 20%, you would have identified about half of the people who would have defaulted (a lift of

about 2.5!).• 30%, you would have identified over 60% of the people who would have responded (a lift of

over 2!).

Chapter 2: Predictive Modeling

2.1 Problem Formulation ..................................................................................................... 27

2.2 Data Preparation and Investigation.............................................................................. 28

2.3 Fitting and Comparing Candidate Models................................................................... 63

2.4 Generating and Using Scoring Code............................................................................. 73

2.5 Generating a Report Using the Reporter Node............................................................ 82


26

27

2.1 Problem Formulation

The data for your first example is from a non-profit organization that relies onfundraising campaigns to support their efforts. After analyzing the data, a subset of 19predictor variables was selected to model the response to a mailing. Two responsevariables were stored in the data set. One response variable related to whether or notsomeone responded to the mailing (TARGET_B), while the other response variablemeasured how much the person actually donated in US dollars (TARGET_D).

Name ModelRole

MeasurementLevel

Description

AGE Input Interval Donor’s ageAVGGIFT Input Interval Donor’s average giftCARDGIFT Input Interval Donor’s gifts to card promotionsCARDPROM Input Interval Number of card promotionsFEDGOV Input Interval % of household in federal governmentFIRSTT Input Interval Elapsed time since first donationGENDER Input Binary F=female, M=MaleHOMEOWNR Input Binary H=homeowner, U=unknownIDCODE Input ID ID code, unique for each donorINCOME Input Ordinal Income level (integer values 0-9)LASTT Input Interval Elapsed time since last donationLOCALGOV Input Interval % of household in local governmentMALEMILI Input Interval % of household males active in the militaryMALEVET Input Interval % of household male veteransNUMPROM Input Interval Total number of promotionsPCOWNERS Input Binary Y=donor owns computer (missing

otherwise)PETS Input Binary Y=donor owns pets (missing otherwise)STATEGOV Input Interval % of household in state governmentTARGET_B Target Binary 1=donor to campaign, 0=did not contributeTARGET_D* Target Interval Dollar amount of contribution to campaignTIMELAG Input Interval Time between first and second donation

* The variable TARGET_D is not considered for the first flow, so its model role will beset to rejected.

The MYRAW data set in the CRSSAMP library contains 6,974 observations for buildingand comparing competing models. This data set will be split equally into training andvalidation data sets for analysis. After evaluating the fitted model, score the data setMYSCORE in the CRSSAMP library to identify those people who would be targeted bythe follow-up mailing.

28

2.2 Data Preparation and Investigation


Begin building the first flow to analyze this data. The toolbar provides you with easyaccess to many of the commonly used nodes. You can add additional nodes to the toolbarby dragging the nodes from the Tools tab to the toolbar. All of the nodes will remainavailable in the Tools tab.

Add an Input Data Source node by dragging the node from the toolbar or from the Toolstab. Since this is a predictive modeling flow, add a Data Partition node to the right of theInput Data Source node. In addition to dragging a node onto the workspace, there are twoother ways to add a node to the flow. You can right-click in the workspace where youwant the node to appear and select Add node from the pop-up menu that appears, or youcan simply double-click where you want the node to appear. In either case, a list of nodesappears and you need only to select the desired node. After selecting Data Partition, youdiagram should look as follows.

Observe that the Data Partition node is selected (as indicated by the dotted line around it)but the Input Data Source node is not. If you right-click in any open space on theworkspace, all nodes become deselected.

Using the Cursor

The shape of the cursor changes depending on where it is positioned. The behavior of themouse commands depends on the shape as well as the selection state of the node overwhich the cursor is positioned. Right-click in an open area to see the menu. The last threemenu items (Connect items, Move items, Move and Connect) enable you to modify theways in which the cursor may be used. Move and Connect is selected by default, and it ishighly recommended that you do not change this setting! If your cursor is not performinga desired task, check this menu to make sure that Move and Connect is selected. Thisselection allows you to move the nodes around the workspace as well as connect them.

Observe that when you put your cursor in the middle of a node, the cursor appears as ahand. You can move the nodes around the workspace as follows:1. Position the cursor in the middle of the node (until the hand appears).2. Press the left mouse button and drag the node to the desired location.3. Release the left mouse button.


29

Note that after dragging a node, the node will remain selected. To deselect all of thenodes, click in an open area of the workspace. Also note that when you put the cursor onthe outside edge of the node, the cursor appears as a cross-hair. You can connect the nodewhere the cursor is positioned (beginning node) to any other node (ending node) asfollows:

4. Ensure that the beginning node is deselected. It is much easier to drag a line when thenode is deselected. If the beginning node is selected, click in an open of theworkspace to deselect it.

5. Position the cursor on the edge of the icon representing the beginning node(until thecross-hair appears).

6. Press the left mouse button and immediately begin to drag in the direction of theending node. Note: If you do not begin dragging immediately after pressing the leftmouse button, you will only select the node. Dragging a selected node will generallyresult in moving the node (no line will form).

7. Release the mouse button after reaching the edge of the icon representing the endingnode.

8. Click away from the arrow. Initially, the connection will appear as follows. Afterclicking away from the line, the finished arrow forms.

Initial Appearance Final Appearance

Identifying the Input Data

The first example uses the MYRAW data set in the CRSSAMP library. To specify theinput data, double-click on the Input Data Source node or right-click on this node andselect Open. The Data tab is active. Your window should look like the one below.

Click on Select in order to select the data set. Alternatively, you can enter the name of thedata set.


30

The SASUSER library is selected by default. To view data sets in the CRSSAMP library,click on the and select CRSSAMP from the list of defined libraries.

Select the MYRAW data set from the list of data sets in the CRSSAMP library and thenselect OK. The resulting dialog appears below.

Observe that this data set has 6,974 observations (rows) and 21 variables (columns).Observe that the field next to Source Data: contains CRSSAMP.MYRAW. You couldhave typed in this name instead of selecting it through the dialog. Note that the lower-right corner indicates a metadata sample of size 2,000. What exactly is a metadatasample?

Understanding The Metadata Sample

All analysis packages must determine how to use variables in the analysis. The EnterpriseMiner utilizes metadata in order to make a preliminary assessment of how to use eachvariable. By default, it takes a random sample of 2,000 observations from the data set ofinterest, and uses this information to assign a model role and a measurement level to eachvariable. If you wish to take a larger sample, you may select the Change button in thelower-right corner of the dialog, but that is unneccessary and is not shown here.

Evaluate (and update, if necessary) the assignments that were made using the metadatasample. Click on the Variables tab to see all of the variables and their respectiveassignments. Click on the first column heading entitled Name to sort the variables bytheir name. A portion of the table showing the first 10 variables is shown below.


31

Observe that two of the columns are grayed out. These columns represent informationfrom the SAS data set that cannot be changed in this node. The Name must conform tothe naming conventions described earlier for libraries. The Type is either character (char)or numeric (num) and affects how a variable can be used. The value for Type and thenumber of levels in the metadata sample of 2,000 is used to identify the model role andmeasurement level.

The first several variables (AGE through FIRSTT) have the measurement level intervalsince they are numeric in the data set and have more than 10 distinct levels in themetadata sample. The model role for all interval variables is set to input by default. Thevariables GENDER and HOMEOWNR have the measurement level binary since theyonly have two different non-missing levels in the metadata sample. The model role for allbinary variables is set to input by default.

It is common to represent a person with a "1" if he has the condition and a "0" otherwise.In this way, HOMEOWNR could have been coded as a numeric variable in the data set.A person would be coded with a "1" if he were a homeowner and a "0" otherwise. Even ifHOMEOWNR was a numeric variable, it would still appear with the binary measurementlevel since it only contains two non-missing levels in the metadata sample.

The variable IDCODE is listed as a nominal variable since it is a character variable withmore than two non-missing levels in the metadata sample. Furthermore, since it isnominal and has a distinct value for every observation in the sample, the IDCODEvariable has the model role ID. If the ID value had been stored as a number, it wouldhave been assigned an interval measurement level and an input model role. Why?

The variable INCOME is listed as an ordinal variable because it is a numeric variablewith more than two but no more than ten distinct levels in the metadata sample. Thisoften occurs with counting variables such as a variable for the number of children. It maybe appropriate in many of these situations to treat the variable as an interval variable,transforming, if necessary. All ordinal variables are set to have the input model role.

Scroll down to see the rest of the variables.


32

The variables PCOWNERS and PETS both are identified as having unary formeasurement level. This is because there is only one non-missing level in the metadatasample. It does not matter in this case whether or not the variable was character ornumeric, the measurement level is set to unary and the model role is set to rejected.

These variables do have useful information, however, and it is the way in which they arecoded that makes them seem useless. Both variables contain the value "Y" for a person ifthe person has that condition (pet owner for PETS, computer owner for PCOWNERS)and a missing value otherwise. Decision trees handle missing values directly, so no datamodification needs to be done for fitting a decision tree; however, neural networks andregression models would ignore any observation with a missing value, so you will needto recode these variables to get at the desired information. Perhaps you could recode themissing values as a "U" for unknown. You will do this later using the Replacement node.

Identifying Target Variables

Note that the variables TARGET_B and TARGET_D are the response variables for thisanalysis. TARGET_B is binary even though it is a numeric variable since there are onlytwo non-missing levels in the metadata sample. TARGET_D has the intervalmeasurement level. Both variables are set to have the input model role (just like any otherbinary or interval variable). Your first analysis will focus on TARGET_B, so you need tochange the model role for TARGET_B to target and the model role TARGET_D torejected, since you should not use a response variable as a predictor.

Change the model role for TARGET_B to target. Then repeat the steps for TARGET_Dbut change the model role to rejected. To modify the model role information, proceed asfollows:1. Position the tip of your cursor over the row for TARGET_B in the model role column

and right-click.2. Select Set Model Role È target from the pop-up menu.


33

Inspecting Distributions

You can inspect the distribution of values in the metadata sample for each of thevariables. To view the distribution of TARGET_B, proceed as follows:1. Position the tip of your cursor over the variable TARGET_B in the Name column.2. Right-click and observe that you can Sort by name, Find name, or View distribution

of TARGET_B.3. Select View distribution to see the distribution of values for TARGET_B in the

metadata sample.

To obtain additional information, select the the View Info tool ( ) from the toolbar atthe top of window and click one of the bars. The Enterprise Miner displays the level andthe proportion of observations represented by the bar. These plots provide an initialoverview of the data. Since the plots are based on the metadata sample, they may varyslightly due to the differences in the sampled observations. Select Close to return to themain dialog when you are finished inspecting the plot.

Investigate the distribution of the unary variables, PETS and PCOWNERS. Whatpercentage of the observations have pets? What percentage of the observations ownpersonal computers? Recall that these distributions depend on the metadata sample. Thenumbers may be slightly different if you refresh your metadata sample; however, thesedistributions are only being used for a quick overview of the data. Later in the course,you will use the Insight node to obtain more detailed information on your data. Your goalin the Input Data Source node is to select the data set for analysis and specify the correctmodel role and measurement level for each variable.

Evaluate the distribution of other variables as desired. For example, consider thedistribution of INCOME. Some analysts would assign the interval measurement level tothis variable. If this were done and the distribution was highly skewed, a transformationof this variable may lead to better results.


34

Modifying Variable Information

Earlier you changed the model role for TARGET_B to target. Now modify the modelrole and measurement level for PCOWNERS and PETS. To modify the model role andmeasurement level information for PCOWNERS, proceed as follows:1. Position the tip of your cursor over the row for PCOWNERS in the model role

column and right-click.2. Select Set Model Role È input from the pop-up menu.3. Position the tip of your cursor over the row for PCOWNERS in the model role

column and right-click.4. Select Set Measurement È binary from the pop-up menu.

In a similar fashion, modify the model role and measurement level information for PETSto the input and binary respectively. Optionally, you could have highlighted both rowsand performed the actions on PCOWNERS and PETS simultaneously.

Some analysts would suggest it may be appropriate to change the measurement level forINCOME to interval but that is not done here.

Understanding the Target Profiler for Binary Target

When building predictive models, the "best" model often varies according to the criteriaused for evaluation. One criteria might suggest that the best model is the one that mostaccurately predicts the response. Another criteria might suggest that the best model is theone that generates the highest expected profit. These criteria can lead to quite differentresults.

In the first analysis, you are analyzing a binary variable. The accuracy criteria wouldchoose the model that best predicts whether or not someone actually responded; however,there are different profits and losses associated with different types of errors. Specifically,it costs less than a dollar to send someone a mailing, but you receive a median of $13.00from those that respond. Therefore, to send a mailing to someone that would not respondcosts less than a dollar, but failing to mail to someone that would have responded costsover $12.00 in lost revenue.

In addition to considering the ramifications of different types of errors, it is important toconsider whether or not the sample is representative of the population. In your sample,approximately 50% of the observations represent responders. In the population, however,the response rate was much closer to 5% than 50%. In order to appropriate predictedvalues, you must specify the prior probabilities in the target profiler. In this situation,accuracy would yield a very poor model indeed, since you would be correctapproximately 95% of the time in concluding that nobody will respond. Unfortunately,this does not satisfactorily solve your problem of trying to identify the "best" subset of apopulation for your mailing.


35

Using the Target Profiler

When building predictive models, the choice of the "best" model depends on the criteriayou use to compare competing models. The Enterpise Miner allows you to specifyinformation about the target that can be used to compare competing models. To generatea target profile for a variable, you must have already set the model role for the variable totarget. This analysis focuses on the variable TARGET_B. To set up the target profile forthis TARGET_B, proceed as follows:

1. Position the tip of your cursor over the row for TARGET_B and right-click.2. Select Edit Target Profile. The message shown below appears.

3. Select Yes.

The Target Profiler opens with the Profiles tab active. You can use the default profile oryou can create your own.

4. Select Edit È Create New Profile to create a new profile.

5. Enter My Profile as the description for this new profile (currently called Profile1).

Although you have created a new profile, the existing profile is still chosen for use asindicated by the asterisk in the Use column.


36

To set the newly created profile for use, proceed as follows:6. Position your cursor in the row corresponding to your new profile in the Use column

and right-click.7. Select Set to use.

The values stored in the remaining tabs of the target profiler may vary according to whichprofile is selected. Make sure that the desired profile is selected and that the associatedtabs have been set as desired before exiting the dialog. If the corresponding to your newprofile is highlighted, investigate the Target tab.

8. Select the Target tab.

This tab shows that TARGET_B is a binary target variable using the BEST12 format. Italso shows that the two levels are sorted in descending order, and that the first listed leveland modeled event is level 1 (the value next to Event). To see the levels and associatedfrequencies for the target, investigate the Levels subtab

9. Select the Levels subtab to see this information. Close the Levels window when youare done.


37

Incorporate profit and cost information into this profile.

10. Select the Assessment Information tab

By default, the target profiler assumes you are trying to maximize profit using the defaultprofit vector. This profit vector assigns a profit of 1 for each responder you correctlyidentify and a profit of 0 for every non-responder you predict to respond. The best modelhas the highest profit. You could also build your model based on loss.

11. Select the row for Loss vector.

This loss vector assigns a profit of 0 for each responder you correctly identify and a lossof 1 for every non-responder you predict to respond. The best model has the lowest cost.

There are also Default profit and Default loss matrices.

12. Select the row for Default profit.

This profit matrix assigns a profit of 1 to correctly predicting a responder or a non-responder.


38

13. Select the row for Default loss.

This loss matrix assigns a cost of 1 for misclassifying a responder or a non-responder.

There are several ways to specify the same information in a problem. For this problem,create a new matrix.

14. Right-click in the open area where the vectors and matrices are listed and select Add.

A new matrix is formed. The new matrix is the same as the default profit matrix, but youcan edit the fields and change the values, if desired. You can also change the name of thematrix.


39

15. Type My matrix in the name field and press the Enter key.

For this problem, responders gave a median of $13.00, and it costs approximately 68cents to mail to each person; therefore, the net profit for

• mailing to a responder is 13.00 - 0.68 = 12.32• mailing to a non-responder is 0.00 - 0.68= -0.68

16. Enter the profits associated with the vector for action (LEVEL=1). Your matrixshould appear as shown below. You may need to maximize your window to see all ofthe cells simultaneously.

The original profit vector is still active as indicated by the asterisk next to Profit vectorin the list on the right side of the dialog.

Make your newly created matrix active.

17. Left click on My matrix to make it highlighted.18. Right-click on My Matrix and select Set to use.


40

19. Select the Edit Decisions subtab.

By default, you attempt to maximize profit. Since your costs have already been built intoyour matrix, do not specify them here. Optionally, you could specified profits of 13 and 0(rather than 12.32 and -0.68) and then used a fixed cost of 0.68 for Decision=1 and 0 forDecision=0, but that is not done in this example. If the cost is not constant for eachperson, the Enterprise Miner allows you to specify a cost variable. The radio buttonsenable you to choose one of three ways to use the matrix or vector that is activated.

You can choose to• maximize profit (default) - use the active matrix on the previous page as a profit

matrix, but do not use any information regarding a fixed cost or cost variable.• maximize profit with costs - use the active matrix on the previous page as a profit

matrix in conjunction with the cost information.• minimize loss - consider the matrix or vector on the prefix as a loss matrix.

20. Close the Decisions subtab without modifying the table.

This data set has been oversampled, and the proportions in the population are notrepresented in the sample.

21. Select the Prior tab.

By default, there are three predefined prior vectors in the Prior tab:• Equal Probability - contains equal probability prior values for each level of the

target.• Proportional to data - contains prior probabilities that are proportional to the data.• None - (default) prior class probabilities are not applied.


41

You can add a new prior vector in the same way you added a new profit matrix.

22. Right-click in the open area where the prior profiles are activated and select Add. Anew prior profile is added to the list, named Prior vector.

23. Highlight the new prior profile by selecing Prior vector.

24. Modify the prior vector to represent the true proportions in the population.

Make the prior vector you specified the active vector.

25. Left-click on Prior vector in the prior profiles list to highlight it.

26. Right-click on Prior vector and select Set to use.

27. Close the target profiler, selecting Yes to save changes when prompted.


42

Investigating Descriptive Statistics

The metadata is used to compute descriptive statistics for every variable. Select theInterval Variables tab.

Investigate the minimum value, maximum value, mean, standard deviation, percentage ofmissing observations, skewness, and kurtosis for interval variables. Inspecting theminimum and maximum values indicates no unusual values (such as AGE=0 orTARGET_D<0). AGE has a high percentage of missing values (23%), while TIMELAGhas a somewhat smaller percentage (8%).

Select the Class Variables tab.

Investigate the number of levels, percentage of missing values, and the sort order of eachvariable. Observe that the sort order for TARGET_B is descending while the sort orderfor all the others is ascending. This occurs since you have a binary target event. It iscommon to code a binary target with a "1" when the event occurs and a "0" otherwise.Sorting in descending order makes the a "1" the first level and this identifies the targetevent for a binary variable. It is useful to sort other similarly coded binary variables indescending order as well for interpretating results of a regression model.

Close the Input Data Source node, saving changes when prompted.


43

Setting up the Data Partition Node


The upper-left corner enables you to choose the method for partitioning.

By default, Enterprise Miner takes a simple random sample of the input data and dividesit into training, validation, and test data sets. Although it is not done here, to perform• Stratified sampling, select the Stratified radio button and then use the options in the

Stratified tab to set up your strata• User Defined sampling, select the User Defined button and then use the options in the

User Defined tab to identify the variable in the data set that identifies the partitions.

The lower-left corner enables you to specify a random seed for initializing the samplingprocess. Randomization within computer programs is often started by some type of seed.If you use the same data set with the same seed in different flows, you will get the samepartition. Observe that resorting the data will result in a different ordering of data andtherefore a different partition will yield potentially different results.

The right side enables you to specify the percentage of the data to allocate to training,validation, and test data. Enter 50 for the values of training and validation.Observe that when you enter the 50 for training, the total percentage (110) turns red,indicating an inconsistency in the values. The number changes color again when the totalpercentage is 100. Close the Data Partition, saving changes when prompted.


44

Preliminary Investigation

Add an Insight node to the workspace and connect it to the Data Partition node asillustrated below. Run the flow from the Insight node by right-clicking on Insight andselecting Run. Select Yes when prompted to see the results. A portion of the output isshown below.

Observe that the upper-left corner has the numbers 2000 and 21, which indicates there are2000 rows (observations) and 21 columns (variables). This represents a sample fromeither the training data set or the validation data set, but how would you know whichone? Close Insight to return to the workspace. Open the Insight node by right-clicking onthe node in the workspace and selecting Open. The Data tab is initially active, and thecontents of the tab are displayed below.

Observe that the selected data set is the training data set. The name of the data set iscomprised of key letters (in this case, TRN) and some random alphanumeric characters(in this case, RSUMU), the TRNRSUMU data is stored in the EMDATA library. Thebottom of the tab indicates that Insight, by default, is generating a random sample of2000 observations from the training data based on the random seed 12345. To changewhich data set Insight is using, select the Select tab.


45

You can see the predecessors listed in a table. The Data Partition node is the onlypredecessor.

Click on the next to Data Partition and then click on the next toSAS_DATA_SETS. Two data sets are shown representing the training and validationdata sets.

Select OK to return to the Data Tab.

Select the button labeled Properties.


46

The Information tab is active. This tab provides information about when the data set wasconstructed as well as the number of rows and columns.

Select the Table View tab.

This tab enables you to view the data for the currently selected data set in tabular form.The check box enables you to see the column headings using the variable labels.Unchecking the box would cause the table to use the SAS variable names for columnheadings. If no label is associated with the variable, the box displays the SAS variablename. Close the data set details window when you are finished to return to the mainInsight dialog.

Select the radio button next to Entire data set to run Insight using the entire data set.

You can run Insight with the new settings by proceeding as follows:1. Close the main Insight dialog window.2. Select Yes when prompted to save changes.3. Run the diagram from the Insight node.4. Select Yes when prompted to see the results.

Note: You can also run Insight without closing the main dialog by selecting the run

icon ( ) from the toolbar and selecting Yes when prompted to see theresults.


47

Look at the distribution of each of the variables as follows:1. Select AnalyzeÈDistribution (Y)2. Highlight all of the variables except IDCODE in the variable list (IDCODE is the last

variable in the list).3. Select Y .4. Select IDCODE.5. Select Label.6. Select OK.

Charts for continuous variables include histograms, box and whisker plots, and assorteddescriptive statistics.

The distribution of AGE is not overly skewed, so no transformation seems necessary.


48

Charts for categorical variables include histograms.

The variable HOMEOWNR has the value "H" when the person is a homeowner and avalue of "U" when the ownership status is unknown. The bar at the far left represents amissing value for homeowner. These missing values indicate that the value forhomeowner is unknown, so recoding these missing values into the level "U" wouldremove the redundancy in this set of categories. You do this later in the Replacementnode.

Some general comments about other distributions appear below.1. INCOME is treated like a continuous variable since it is a numeric variable.2. There are more females than males in the training data set, and the observations with

missing values for GENDER should be recoded to "M" or "F" for regression andneural network model.

3. The variable MALEMILI is a numeric variable, but the information may be betterrepresented if the values are binned into a new variable.

4. The variable MALEVET does not seem to need a transformation, but there is a spikein the graph near MALEVET=0.

5. The variables LOCALGOV, STATEGOV, and FEDGOV may benefit from a logtransformation.

6. The variables PETS and PCOWNERS only contain the values Y and ".". Recodingthe missing values to "U" for unknown would make these variable more useful forregression and neural network models.

7. The distributions of CARDPROM and NUMPROM do not need any transformation.8. The variables CARDGIFT and TIMELAG may benefit from a log transformation.9. The variable AVGGIFT may yield better results if its values are binned.


49

You can see how responders are distributed within Insight. To do so, proceed as follows:1. Scroll to the distribution of TARGET_B.2. Select the bar corresponding to TARGET_B=13. Scroll to the other distributions and inspect the highlighting pattern.

Examples of the highlighting pattern for TIMELAG and PCOWNERS is shown below.There are no clear relationships that are obvious based on these graphs.

When you are finished, return to the main process flow diagram as follows:1. Close the distribution window when finished.2. Close the INSIGHT data table.3. If you ran INSIGHT without closing the node, close the INSIGHT node (saving

changes if prompted).


50

Performing Variable Transformations

Some input variables have highly skewed distributions. In highly skewed distributions, asmall percentage of the points may have a great deal of influence. On occasion,performing a transformation on an input variable may yield a better fitting model. Thissection demonstrates how to perform some common transformations.

Add a Transform Variables node to the flow as shown below. After connecting it, openthe node by right-clicking on it and selecting Open. The Variables tab is shown bydefault, which displays statistics for the interval level variables including the mean,standard deviation, skewness, and kurtosis (calculated from the metadata sample). TheTransform Variables node enables you to rapidly transform interval valued variablesusing standard transformations. You can also create new variables whose values arecalculated from existing variables in the data set. Click on column heading for the Namecolumn to sort by Name. Observe that the only non-greyed column in this dialog is theKeep column.

You can view the distribution of each variables just as you did in the Input Data Sourcenode. Begin by viewing the distribution of AGE. The distribution of AGE is not highlyskewed, so no transformation is performed. Close the distribution of AGE.


51

Investigate the distribution of AVGGIFT.

This variable has the majority of its observation near zero, and very few observationsappear to be higher than 30. Consider creating a new grouping variable that creates binsfor the values of AVGGIFT. You can create just such a grouping variable in severaldifferent ways.1. Bucket - creates cutoffs at approximately equally spaced intervals.2. Quantile - creates bins with approximately equal frequencies.3. Optimal Binning for Relationship to Target - creates cutoffs that yield optimal

relationship to target (for binary targets).

The Optimal Binning for Relationship to Target transformation uses the DMSPLITprocedure to optimally split a variable into n groups with regards to a binary target. Thisbinning transformation is useful when there is a nonlinear relationship between the inputvariable and the binary target. An ordinal measurement level is assigned to thetransformed variable.

To create the n optimal groups, the node applies a recursive process of splitting thevariable into groups that maximize the association with the target values. To determinethe optimum groups and to speed processing, the node uses the metadata as input. Formore detail, see the Help files for the Transform Variables node in the Enterprise Miner.

The Enterprise Miner provides a great deal of help files organized for easy access. Toobtain software based help on this this topic, proceed as follows:1. Select Help È Enterprise Miner Reference2. Double-click on book Enterprise Miner Version 3.0 Reference Help.

Note: Optionally, select the book titled Enterprise Miner Version 3.0 Reference Helpand then select Open.

3. Double-click on the Transform Variables Node.Note: Optionally, select Transform Variables Node and then select Display.

4. Select Creating Transformed variables.5. Select Binning Transformations.6. Close the help window when you are finished.


52

Create bins for AVGGIFT. Suppose your earlier analysis suggested binning the values into the intervals 0-10, 10-20, and 20+. To create the binning variable, proceed as follows.1. Position your cursor over the row for AVGGIFT.2. Right-click and choose Transform È Bucket.

Note: Optionally, select the Transform È Quantile.3. The default number of buckets is 4. Change this value to 3 using the arrows.

4. Select Close.

5. Enter 10 in the Value field for Bin 1.6. Use the to change from Bin 1 to Bin 2.7. Enter 20 in the Value field for Bin 2. The result appears as pictured below:

8. Close the plot to return to the previous window.


53

A new variable is added to the table. The new variable has the truncated name of theoriginal variable followed by a random string of digits. Note that the Enterprise Miner setthe value of Keep to No for the original variable. If you wanted to use both the binnedvariable and the original variable in the analysis, you would need to modify this attributefor AVGGIFT and the set the value of Keep to Yes, but that is not done here.

Examine the distribution of the new variable.

The View Info tool reveals there is over 40% of the data in each of the two lowestcategories, while there is approximately 10% of the data in the highest category.

Note: For a workshop, consider using this variable as a response variable in anotheranalysis. In order to do so, you will have to use the Data Set Attributes node tochange the model role of TARGET_B and TARGET_D to rejected and the role ofthe newly created variable to target.

Recall that the distributions of LOCALGOV, STATEGOV, FEDGOV, CARDGIFT, andTIMELAG were highly skewed to the right. A log transformation of these variables mayprovide more stable results.

Begin by transforming CARDGIFT. To do so, proceed as follows:1. Position the tip of the cursor on the row for CARDGIFT and right-click.2. Select Transform È Log


54

Inspect the resulting table.

The formula shows that the Enterprise Miner has actually performed the logtransformation after adding one to the value of CARDGIFT. Why has this occurred?Recall that CARDGIFT has a minimum value of zero. The logarithm of zero isundefined, and the logarithm of something close to 0 is extremely negative. TheEnterprise Miner takes this information into account and actually uses the transformationlog(CARDGIFT+1) creating a new variable with values greater than or equal to zero(since the log(1)=0).

Inspect the distribution. It is much less skewed than before.

You can perform log transformations on the other variables (LOCALGOV, STATEGOV,FEDGOV, and TIMELAG) by proceeding as follows:1. Select one of the variables, say FEDGOV.2. Press and hold the Ctrl key on the keyboard.3. While holding the Ctrl key, select each of the other variables.4. When all have been selected, release the Ctrl key.5. Right-click on one of the selected rows and select Transform È Log.6. View the distributions of these newly created variables.

It may be appropriate at times to keep the original variable and the created variablealthough it is not done here. It is also not commonly done when the original variable andthe transformed variable have the same measurement level.

Close the node when you are finished, saving changes when prompted.


55

Understanding Data Replacement

Add a Replacement node to the diagram. This will allow you to modify the missingvalues for the class variables. The Replacement node allows you to replace missingvalues with new values. This replacement is necessary to utilize all of the observations inthe training data for building a regression or neural network model. Decision trees handlemissing values directly, while regression and neural network models ignore allobservations with missing values on any of the input variables. It is more appropriate tocompare models built on the same set of observations, so you should perform datareplacement before any regression or neural network model when you plan on comparingthe results to those obtained from a decision tree model. Your new diagram should appearas follows:

Open the Replacement node. The Defaults tab is displayed first. Check the box forCreate imputed indicator variables and use the arrow to change the Role field to input.

This box requests the creation of new variables, each having a prefix "M_", which have avalue of "1" when an observation has a missing value for the associated variable and "0"otherwise. Observe that if the missingness of a variable is related to the response variable,the regression and the neural network model can use these newly created indicatorvariables to identify observations that had missing values originally.

Note: The Replacement node allows you to replace certain values before imputing.Perhaps a data set has coded all missing values as 999. In this situation, check thebox next to Replace before imputation and then have the value replaced beforeimputing.


56

Using Data Replacement

Select the Data subtab. Most nodes have a Data tab that enables you to see the names ofthe data sets being processed as well as a view of the data in each one. The radio buttonnext to Training is selected. The Enterprise Miner has assigned the name TRNTP31E forthe training data set and it is stored in the EMDATA library. The name of the data set iscomprised of some key characters (for this example, TRN) followed by several randomlygenerated alphanumeric characters (for this example, TP31E).

This type of naming is necessary since SAS 6.12 only allows a maximum of 8alphanumeric characters. The description of the data set allows you to identify the dataset as the transformed training data. You can see the name assigned to the validation dataset by clicking on the radio button next to Validation.

Since the process involves random assignment, the name of your data set will almostcertainly be different; however, the library name (EMDATA) will be the same for everyproject. The EMDATA library name points to the EMDATA folder in the project libraryfor this project. When opening up a new project, the EMDATA library name isreassigned to the EMDATA folder in the the new project folder.

To view a data set, select the appropriate radio button and then select Properties.Information about the data set appears. Select Table View to see the data. The trainingdata set is shown below. Uncheck the Variable labels box to see the variable names.

Close the Data Set details window.


57

Select the Training subtab under the Data tab.

By default, the imputation is based on a random sample of the training data. The seed isused to initialize the randomization process. Generating a new seed will create a differentsample. To use the entire training data set, select the button next to Entire data set. Thesubtab information now appears as pictured below.

Return to the Defaults tab and select the Imputation Methods subtab. This shows thatthe default imputation method for Interval Variables is the mean (of the random samplefrom the training data set or the entire training data set depending on the settings in theData tab). By default, imputation for class variables is done using the most frequentlyoccurring level (or mode) in the same sample. If the most commonly occuring value ismissing, it uses the second most frequently occurring level in the sample.

Click on the arrow next the method for interval variables. The Enterprise Miner providesthe following methods for imputing missing values for interval variables:1. Mean - (default) or the arithmetic average2. Median - or the 50th percentile.3. Midrange - the maximum plus the minimum divided by two.4. Distribution-based - replacement values are calculated based on the random

percentiles of the variable’s distribution.


58

5. Tree imputation - replacement values are estimated using a decision tree using theremaining input and rejected variables that have a status of use as the predictors.

6. Tree imputation with surrogates - same as above but using surrogate variables forsplitting whenever a split variable has a missing values. This will prevent forcingeveryone with a missing value for a variable into the same node.

7. Mid-min spacing - the mid-minimum spacing. To calculate this statistic, the data istrimmed is using N percent of the data as specified in the Proportion for mid-minimum spacing entry field. By default, 90% of the data is used to trim the originaldata. The maximum plus the minimum divided by two for the trimmed distribution isequal to the mid-minimum spacing.

8. Tukey’s biweight, Huber’s, and Andrew’s wave - These are robust M-estimators oflocation. This class of estimators minimize functions of the deviations of theobservations from the estimate that are more general than the sum of squareddeviations or the sum of absolute deviations. M-estimators generalize the idea of themaximum-likelihood estimator of the location parameter in a specified distribution.

9. Default constant - you can set a default value to be imputed for some or all variables.You set up the default constant using the Constant values subtab next.

10. None - turns off the imputation for the interval variables.

Click on the arrow next to the method for class variables. The Enterprise Miner providesseveral of the same methods for imputing missing values for class variables includingdistribution-based, tree imputation, tree imputation with surrogates, default constant, andnone. You can also choose most frequent value (count) that uses the mode of the databeing used for imputation. If the most commonly occurring value for a variable ismissing, the Enterprise Miner uses the next most frequently occurring value.

Select Tree imputation as the imputation method for both types of variables.

When using the tree imputation for imputing missing values, use the entire training dataset for more consistent results.

Regardless of the values set in this section, you can select any imputation method for anyvariable. This tab merely controls the default settings.


59

Select the Constant values subtab. This subtab enables you to replace certain values(before imputing, if desired, using the checkbox on the Defaults tab). It also enables youto specify constants for imputing missing values. Enter U in the field for charactervariables.

The constant is used for imputing the missing values of numeric(character) variableswhen you select default constant as the default method for numeric(character) variableson the Imputation methods subtab. The default imputed value for a variable is based onwhether the variable is character or numeric in the data set. In this example, you are usinga tree to impute missing values.

Select the Tree Imputation tab. This tab enables you to set the variables that will beused when using tree imputation. Observe that target variables are not available, andrejected variables are not used by default. To use a rejected variable, you can set theStatus to use, but that is not done here.


60

Select the Interval Variables tab. Suppose you want to change the imputation methodfor AGE to mean and CARDPROM to 20. First click on the column heading for Name tosort the variables by name (this positions AGE and CARDPROM at the top of the list).

To specify the imputation method for AGE, proceed as follows:1. Position the tip of your cursor on the row for AGE in the Imputation Method column

and right-click.2. Select Select Method È Mean.

To specify the imputation method for CARDPROM, proceed as follows:1. Position the tip of your cursor on the row for CARDPROM in the Imputation Method

column and right-click.2. Select Select Method È Set Value.3. Type 20 in the New Value field.4. Select OK.

Specify none as the imputation method for TARGET_D in like manner. Inspect theresulting window. A portion of the window appears below.


61

Select the Class Variables tab.

To modify the imputation method for HOMEOWNR, proceed as follows:1. Right-click on the row for HOMEOWNR in the Imputation Method column.2. Select Select method È Set value.

3. Select the radio button next to Data Value.4. Use the arrow to choose "U" from the list of data values.

5. Select OK to accept the value and return to the previous dialog.


62

To modify the imputation method for PETS and PCOWNERS, proceed as follows:1. Right-click on the row for PETS in the Imputation Method column.2. Select Select Method È default constant.3. Repeat steps 1 and 2 for PCOWNERS.

To change the imputation for TARGET_B to none, proceed as follows:1. Right-click on the row for TARGET_B in the Imputation Method column.2. Choose Select method È none. This prevents the addition of imputation for target

variable in the Score node. Inspect the resulting window.

Select the Output tab. While the Data tab shows the input data, the Output tab shows theoutput data set information.

Close the Replacement node saving the changes when prompted.

63

2.3 Fitting and Comparing Candidate Models

Fitting A Regression Model

Connect a Regression node. The diagram should now look like the illustration below.

Open the regression node. Find the Tools menu on the top of the session window andselect Tools È Interaction Builder. This tool enables you to easily add interactions andhigher-order terms to the model, although you do not do so now.

The input variables are shown on the left, while the terms in the model are shown on theright. The Regression node fits a model containing all main effects by default. If novariable selection has been done, this model will probably not overfit the training dataset. Close the Interaction Builder window when you are finished inspecting.


64

Select the Selection Method tab. This tab enables you to perform different types ofvariable selection using various criteria. No selection is done by default.

You can choose from the following variable selection techniques:1. Backward - begins, by default, with all candidate effects in the model and then

systematically removes effects that are not significantly associated with the targetuntil no other effect in the model meets the Stay Significance Level or until the Stopcriterion is met. This method is not recommended when the target is binary or ordinaland there are many candidate effects or many levels for some classification inputvariables.

2. Forward - begins, by default, with no candidate effects in the model and thensystematically adds effects that are significantly associated with the target until noneof the remaining effects meet the Entry Significance Level or until the Stop criterionis met.

3. Stepwise - As in the Forward method, Stepwise selection begins, by default, with nocandidate effects in the model and then systematically adds effects that aresignificantly associated with the target. However, after an effect is added to themodel, Stepwise may remove any effect already in the model that is not significantlyassociated with the target.

4. None - (default) all candidate effects are included in the final model.

Choose Stepwise using the arrow next to the Method field.

The stopping criteria field enables you to set the maximum number of steps before theStepwise methods stops. The default is set to the number of effects in the model.

The Stepwise methods uses cutoffs for variables entering the model and for variablesleaving the model.

Changing these values may impact the final variables included in the model.

2.3 Comparing Candidate Models

65

Inspect the Effect Hierarchy options in the lower-left corner of the window.

Model hierarchy refers to the requirement that for any effect in the model, all effects thatit contains must also be in the model. For example, in order for the interaction A*B to bein the model, the main effects A and B must also be in the model. The Effect Hierarchyoptions enable you to control how a set of effects is entered into or removed from themodel during the effect selection process.

Consider the Number of Variables subsection in the lower-right corner of the window.

This enables you to select a specified number of variables to begin the selection process(for forward) or select a minimum number of variables to remain in the model. The orderdepends on the order displayed in the Interaction Profiler. To change the order of effects,you can select Tools È Model Ordering but that is not done here.

Close the Regression node saving the changes when prompted. Since you have changedthe default settings for the node, it prompts you to change the default model name. EnterStepReg in the Model Name field.

Select OK.


66

Evaluating the Model.

Add an Assessment node to the diagram. Your flow should now look as follows.

Right-click on the Assessment node and select Run. This will allow you to generategains charts for the regression. Observe that each node becomes green as it runs. Sinceyou ran the flow from the Assessment node, you are prompted to see the Assessmentresults. Select Tools È Lift Chart.

A Cumulative %Response chart is shown by default. This chart groups people based ontheir predicted probability of response, and then plots the percentage of respondents. Tosee the exact percentage, click on the View Info tool and then click on the red line. Theprobability of response in the first decile (top 10%) is 9.38, but what does that mean?


67

To interpret the Cumulative %Response chart, consider how the chart is constructed.1. For this example, a responder is defined as someone who defaulted on a loan

(BAD=1). For each person, the fitted model (in this case, a regression model) predictsthe probability that the person will default. Sort the observations by the predictedprobability of response from the highest probability of response to the lowestprobability of response.

2. Group the people into ordered bins, each containing approximately 10% of the data.3. Using the target variable BAD, count the percentage of actual responders in each bin.

If the model is useful, the proportion of responders (defaulters) will be relatively high inbins where the predicted probability of response is high. The blue line represents thebaseline rate (5%) for comparison purposes. You can also see the % Response in eachbin. Select the radio button next to Non-cumulative on the left side of the graph.

Cumulative Response Non-Cumulative %Response


68

Select the Cumulative button and then select Lift Value. Lift charts plot the sameinformation on a different scale. Recall that the population response rate is 5%. A liftchart can be obtained by dividing the response in each decile by 5% to get a relativeimprovement over baseline. Recall that the cumulative percentage of respondents in thefirst decile was 9.38%. The Lift Chart plots the value 9.38/5.00=1.88 in the first decile.This indicates that if you chose this bin for action, you would expect to get 1.88 times asmany responders as you would from taking a simple random sample of the same size.

Cumulative %Response Cumulative Lift Value

Instead of asking the question, "What percentage of observations in a bin wereresponders?", you could ask the question, "What percentage of the total number ofresponders are in a bin?". This can be evaluated using the Captured Response curve. Toinspect this curve, select the radio button next to %Captured Response. You can use theview info tool to evaluate how the model performs. For example, if you mailed to the top20% of the observations as scored by your model, you would capture just over 35% ofthe people who would have responded (a lift of about 1.75!)


69

You can also consider cumulative and non-cumulative profit charts by selecting the radiobutton next to Profit.

Cumulative Profit Non-Cumulative Profit

You can temporarily reassign the profits using the portion of the interface pictured below,but that is not done here.

If you use the Edit button to modify the target profile, you must click on Apply aftermaking your modifications to see the results of the modification in the chart. The buttonfor Apply is greyed out here since no changes have been made.

Select the View Lift Data icon on the toolbar ( ) and inspect the resulting table.

The first ten rows have information about the baseline model, while the next ten rowshave information about the regression model. Observe that the Expected Profit drops to 0beyond the third decile of the regression model. Enterprise Miner uses the target profilerto identify which predicted response yields the greatest profit. The Expected Profit is thencalculated based on the predicted response. The profit drops to zero for people beyondthe third decile since these people would not be targeted for the mailing.


70

Fitting A Default Decision Tree

Add a default Tree node to the workspace. Connect the Data Partition to the Tree, andthen connect the Tree to the Assessment node. The flow should now appear like the onepictured below.

A decision tree handles missing values directly, so it does not need data replacement.Monotonic transformations of interval numeric will probably not improve the tree fitsince the tree bins numeric variables. In fact, the tree may actually perform worse if youconnect it after binning a variable in the Transform Variables node, since binning reducesthe splits the tree can consider (unless you include the original variable and the binnedvariable in the model).

Run the flow from the Assessment node and select Yes when you are prompted to viewthe results. The Assessment node opens with two models displayed. The values in theModel ID column will probably be different in your flow.

You can change the name for the model by editing the Name column. This feature isespecially useful when you fit several competing models of the same type (regression,decision tree). Enter the name DefTree in the Name column for the Tree tool to indicatethat you have fit a default tree. Your window should now look like the one below.


71

To generate a lift chart, highlight both rows in the Assessment node. You can do this byselecting the row for one model and then Ctrl-clicking on the row for the other model.You can also drag through both rows to highlight them simultaneously. Your resultingwindow should appear like the diagram below.

Select Tools È Lift Chart to compare how the models perform on the validation dataset. Observe that the regression model outperforms the default tree throughout the graph.

Occasionally the entire legend is not visible. You can view the entire legend bymaximizing the window, or by modifying the graph window until you see the entirelegend.

To modify the graph window,

• click on the Move/Reside Legend icon ( ) on the toolbar at the top and move thelegend or resize the legend vertically and/or horizontally.

• click on the Move Graph icon ( ) on the toolbar and reposition the graph.

The default legend and formatted legend are pictured below. To use the names that youentered, select Format È Model Name from the menu at the top of the window.

Legend Using Tool Name (default) Legend Using Model Name

Close the Assessment node when you have finished inspecting the various lift charts.


72

Fitting A Default Neural Network

Add a default Neural Network node to the workspace. Connect the Replacement node tothe Neural Network, and then connect the Neural Network node to the Assessment node.The flow should now appear like the one pictured below.

The neural network, like the regression model, requires data replacement. If replacementis not done, the model is fit using only complete observations. Run the flow from theAssessment node and select Yes when prompted to view the results. Highlight all threemodels and select Tools È Lift Chart to compare these models. The lift chart appearslike the one below.

The neural network and the tree models do not perform as well as the regression model.

73

2.4 Generating and Using Scoring Code

The Score! node can be used to evaluate, save, and combine scoring code from differentmodels. To see scoring code from each of the models, you would have to add a Score!node to the workspace and then connect every modeling to it. Connecting all of thesenodes to the scoring node could make the workspace rather confusing. You can use aControl Point node to accomplish the same task and still have a clear diagram.

Modify your workspace to appear like the diagram below as follows:1. Delete all lines connected to the Assessment node. You can delete a line by right-

clicking on it and selecting Delete. Optionally, you can delete a line by selecting theline and then pressing Backspace or Delete on your keyboard.

2. Move the Assessment node to the right to make room for the Control Point.3. Drag a Control Point node onto the workspace and position it where the Assessment

node was previously positioned.4. Drag a Score! node onto the workspace and position it above the Assessment node.5. Connect the modeling nodes to the Control Point.6. Connect the Control Point node to the Assessment node and Score! node.

Open the Score! node. The Settings tab is active.


74

The Settings tab provides options for when you run the Score! node in a path.

The following radio button choices are available:1. Inactive (default) - exports the most recently created scored data sets.2. Apply training data score code to score data set - applies scoring code to the score

data set.3. Accumulate data sets by type - copies and exports data sets imported from

predecessor nodes. If you use this action in a path containing a group processingnode, the output data sets are concatenated.

4. Merge data sets by type - merges data sets imported from predecessor nodes. Forexample, you can use this action to merge the training data sets from two modelingnodes to compare the predicted values. If the number of observations in each scoredata set is not the same, then an error condition is created.

The Score Code tab enables you to see the scoring code for each model connected to theScore! node.

Click on the arrow to see the available management functions. By default, the CurrentImports are listed in the left list box of the Score code tab. The other list options include• Current imports - (default) lists the scoring code currently imported from node

predecessors.• Accumulated runs - lists scoring code that is exported by the node’s predecessors

during the most recent path run (training action). If the training action involves groupprocessing, a separate score entry is listed for each group iteration for eachpredecessor node. This is the only access to score code generated from groupprocessing.

• Saved - lists saved or merged score code entries.• All - lists all score code entries that are managed by the node.


75

To see the scoring code for a model, double-click on the desired model in the list on theleft, and the associated scoring code is displayed in the window on the right. The code isa SAS program that performs a SAS data step. You can use the scoring code on anysystem having base SAS.

If you modify the settings in a modeling node and run the flow, the scoring codeassociated with the affected model is updated. To keep modifications in the workspacefrom affecting the scoring code, you can save the scoring code as follows:1. Select the name of the model used for developing the scoring code you want to save

from the list on the left hand side of the window. For this example, save the scoringcode for the regression model.

2. Right-click on the selected model and select Save.

A dialog opens that enables you to name the saved source file. You can enter a name ifdesired, although this is not necessary.

3. Type in a name, such as My Regression Code.

4. Press OK. The Enterprise Miner now displays the Saved runs.

The code is now saved within Enterprise Miner. To use the code outside of EnterpriseMiner in a SAS session, you need to export the scoring code from Enterprise Miner. Youcan export the scoring code as follows:1. Highlight the name representing the desired code in the list on the left side.2. Right-click on the highlighted name and select Export.3. Enter a name for the saved program such as MyCode and select Save.


76

Scoring Using base SAS (optional)

You can use the saved scoring code to score a data set using base SAS. Enterprise Minerruns on top of a SAS session. You can use this SAS session regardless of the currentwindow in the Enterprise Miner. Use SAS to score the MYSCORE data set in theCRSSAMP library. This is a data set with all of the inputs for the model but no responseinformation. To do so, proceed as follows:1. Select Window È Program Editor to make the SAS session active.2. Select File È Open.3. Find and select the program that you just saved (named MYCODE in this example).

Note: If you used the default folder when saving the code, it will be in the same folderthat opens when you select File È Open.

4. Select Open. The scoring code appears in the Program Editor of the SAS session. Aportion of the code appears below.

The data set _PREDICT is the data set that is created containing the predicted values. Thedata set represented by _SCORE is the data set you want to score. Since these data setsare being used in a macro (preceded by "&_"), the data sets need to be initialized.


77

5. Score the MYSCORE data set in the CRSSAMP library. To do so, first initialize_PREDICT and _SCORE using the following code:

The second line will initialize _PREDICT. There is actually no "X" data set. It is justa dummy name. The actual _PREDICT data set is recreated by the scoring code.

6. To see the results of scoring, add the following code at the end of the program:

This code will print IDCODE and P_TARGE1 (predicted probability of response).

7. Submit the scoring code by selecting Locals È Submit or by selecing the Submit

icon ( ) from the toolbar. Inspect the resulting output.

The cutoff for the regression model was approximately 0.055. To implement this model,send the mailing to people with a predicted probability of response (P_TARGE1) greaterthan or equal to 0.055.

8. Select Window È Score! to return to Enterprise Miner.9. Close the Score! node to return to Enterprise Miner workspace.


78

Scoring Within The Enterprise Miner

You have just used the saved scoring code to score a data set using base SAS. Now scorethe same data set using the Enterprise Miner. Begin by adding another Input Data Sourcenode to the flow and connect it to the Score! node. Add an Insight node and connect theScore! node to it as pictured below.

Select the MYSCORE data set from CRSSAMP library.

Change the role of the data set from RAW to SCORE.

Observe that the data set has 12,765 rows and 19 columns.

Inspect the variables if you desire. There is no need to modify the variables here, sincethe role and level of each variable is built into the scoring code. After inspection, closethis Input Data Source, saving changes when prompted.


79

Open the Score! node. By default the Score! node is inactive when running items in apath. Select the radio button next to Apply training data score code to score data set.The Score! node will now add prediction information to the data set that it is scoring.

After requesting the Score! node to apply the scoring data, the Output variables subtabbecomes available. This subtab enables you to control what values are added to thescored data set. All variables are included by default, but the options shown below allowyou to drop certain variables, if desired. No variables are dropped in this example.

Recall that you fit three different models. Which one of these models will be used forscoring? You can control this by specifying the desired model in the Data tab. Select theData tab.

Choose the Select button to see the list of predecessors.


80

A window opens displaying the predecessor nodes.

Use the browser to find the data set associated with the regression node. Highlight thedata set as shown below.

Select OK to accept this selection. The regression code will now be used for scoring thenew data set.

Close the Score! node, saving changes when prompted.

Next open the Insight node. Choose the Select option on the Data tab to select the data setassociated with the score data. This data set will typically have a SD prefix followed by astring of random alphanumeric characters.


81

The selected data set is SD_4YNZJ. The Description field that this represents score data.

Select OK to return to the Data tab in Insight. Choose the option to use the entire data setoption. Then close Insight, saving changes when prompted. Run Insight and view theresults.

The scored data set now has 48 variables. Only 19 variables were in the original data set,so the scoring code has added 29 additional variables to this data set. If you only want toadd selected variables when scoring, you can specify fewer variables in the Score! nodeas described earlier. You can see some of the newly created variables by scrolling to theright.

82

2.5 Generating a Report Using the Reporter Node

To see the results of your analysis, you can add a Reporter node to your workspace. Add thereporter node after the Assessment node so that the Assessment results are included in the report.Run the flow from the Reporter node. Observe that the nodes become yellow as they areactivated.

When the run is finished, you can select OK to acknowledge the creation of the report or Opento open the report using your default browser.

If you do not look at the report, you can view it later by selecting the Reports subtab.

Chapter 3: Variable Selection

3.1 Introduction to Variable Selection ................................................................................ 85

3.2 Using the Variable Selection Node................................................................................. 86

3.3 Using the Tree Node ....................................................................................................... 90


84

85

3.1 Introduction to Variable Selection

When analyzing data, you are often faced with choosing a subset of variables. While anearlier example used stepwise regression to select a subset of input variables on which tobuild the model, this method may not perform as well when evaluating data sets with dozens(or hundreds!) of potential input variables. You can perform variable selection using twoother nodes in the Enterprise Miner. This chapter explores techniques to identify importantvariables using the Variable Selection node and using the Decision Tree node.

For the purposes of this chapter, consider the first flow you constructed. To begin,• add a Variable Selection node after the Replacement node.• add a Tree node after the Data Partition node.

Your workspace should now appear as follows:

86

3.2 Using the Variable Selection Node

Open the Variable Selection node. The Variables tab is active.

Select the Manual Selection tab. This tab enables you to force variables to be included orexcluded from future analyses. By default, the role assignment is automatic, which meansthat the role will be set based on the analysis performed in this node.

Select the Target Associations tab. This tab enables you to choose one of two selectioncriteria and specify options for the chosen criterion. By default, the node will removevariables unrelated to the target (according to the settings used for the selection criterion) andscore the data sets. Consider the settings associated with the default R-square criterion first.


87

Selection using R-square Criterion

Since R-square is already selected as the selection criterion, click on the Settings button onthe Target Association tab.

The R-square criterion uses a goodness-of-fit criterion to evaluate variables. It uses astepwise method of selecting variables that stops when the improvement in the R2 is lessthan 0.00050. By default, the method rejects variables whose contribution is less than 0.005.

The following three-step process is done when you apply the R-square variable selectioncriterion to a binary target. If the target is non-binary, only the first two steps are performed.

1. Enterprise Miner computes the squared correlation for each variable and then assigns therejected role to those variables that have a value less than Squared correlation criterion(default 0.00500).

2. Enterprise Miner evaluates the remaining significant (chosen) variables using a forwardstepwise R2 regression. Variables that have a Stepwise R2 improvement less than thethreshold criterion (default 0.00050) are assigned the rejected role.

3. For binary targets, the Enterprise Miner performs a logistic regression using the predictedvalues output from the forward stepwise regression as the independent input.

Additional options enable you to• test 2-way interactions - when selected, this option requests the Enterprise Miner to

evaluate 2-way interactions for categorical inputs.• bin interval variables in up to 16 bins-when selected, this option requests the Enterprise

Miner to bin interval variables into 16 equally spaced groups (AOV16) . The AOV16variables are created to help identify non-linear relationships with the target. Bins withzero observations are eliminated, meaning an AOV16 variable can have less than 16 bins

• use only grouped class variables-when selected - when selected the Enterprise Miner usesonly the grouped class variable to evaluate variable importance. Deselecting this optionwill request the Enterprise Miner to use the grouped class variable as well as the originalclass variable in evaluating variable importance that may greatly increase processingtime.

Leave the default settings and close the node. Run the flow from the Variable Selection node andview the results.


88

Click on the Role column heading. Then click on the Rejection Reason column heading. Inspectthe results.

LASTT and CARD_2N2 (from CARDGIFT) are retained. Select the R-square tab.

This tab shows the R-square for each effect with TARGET_B. Select the Effects tab.

This tab shows the total R-squared as each variable is added into the model


89

Selection using Chi-square Criterion

Open the Variable Selection node and choose the Chi-square criterion.

Click on the Settings button and inspect the options.

Variable selection is performed using binary splits for maximizing the Chi-square values of a2x2 frequency table.

• Bins option - determines the number of categories in which the range of each intervalvariable is divided for splits. By default, interval inputs are binned into 50 levels.

• Chi-square option - value governs the number of splits that are performed. By default, theChi-square value is set to 3.84. As you increase the Chi-square value, the procedureperforms fewer splits.

• Passes option - node makes 6 Passes through the data to determine the optimum splits.

Close this node, saving changes when prompted, and rerun the flow using with default Chi-square settings. Inspect the results.

If the resulting number of variables is too high, consider increasing the Chi-square cutoff value.Increasing the Chi-square cutoff value will generally reduce the number of retained variables.

90

3.3 Using the Tree Node

Open the Tree node. The Variables tab is active by default. Select the Basic tab.

Select Gini reduction for the splitting criterion.

Select the Advanced tab.

Change the model assessment measure to Total Leaf Impurity.

3.3 Using the Tree Node

91

Select the Score tab.

Check the box next to Training, Validation, and Test. This option requests Enterprise Minerto modify the status of the variables before passing them on to subsequent nodes.

Select the Variables subtab. This tab controls in part how the Tree node will modify the dataset when it is run.

Close the tree node, saving changes when prompted. It is unnecessary to specify a differentname for the model, but you may do so if desired when prompted. Select OK to exit.


92

Run the flow from the Tree node and view the results. The All tab is active by default.

Thirty-nine trees were fit to the training data. The 29-leaf tree had the lowest total impurityon the validation data set, so it is selected by default. You will consider the other tree resultsin a later chapter. Select the Score tab and then select the Variables subtab. This subtabenables you to see the variables that have been retained and those that have been rejected.

The chosen tree retains 12 variables for analysis. You could add a Regression node or NeuralNetwork node to the flow following this Tree node; however, since no data replacement orvariable transformations have been performed, you should consider doing these things first(for the input variables identified by the variable selection). In general, a tree with moreleaves will retain a greater number of variables, while a tree with fewer leaves will retain asmaller number of variables.

Chapter 4: Neural Networks

4.1 Visualizing Neural Networks ......................................................................................... 95

4.2 Visualizing Logistic Regression ................................................................................... 102


94

95

4.1 Visualizing Neural Networks

To allow visualization of the output from a MLP, a network will be constructed with only twoinputs. Two inputs permit direct viewing of the trained prediction model and speed up training.Insert a new diagram in the My Projects folder and assemble the diagram shown below.

An Input Data Source node connects to a Data Partition node. The Data Partition node connectsto a Replacement node. The Replacement node connects to a Neural Network node. The NeuralNetwork node connects to an Insight node.

Select the data for this example.

1. Open the Input Data Source node.

2. Select the BUYRAW data set from the CRSSAMP library.

3. Set the model role of RESPOND to target.

4. Set the model role of all other variables, except AGE and INCOME, to rejected.

5. Close and save changes to the Input Data Source node.

6. Partition the data.


1. Set Validation to 60 and Test to 0.

No test set will be needed for this example. For efficiency, the test data will be grouped withthe validation data.

2. Close and save changes to the Data Partition node.


96

Now construct the MLP.1. Open the Neural Network node. The Variables tab is active.

2. Select the General tab.

You can specify one of the following criteria for selecting the best model:• Average Error - chooses the model that has the smallest average error for the

validation data set.• Misclassification Rate - chooses the model that has the smallest misclassification rate

for the validation data set.• Profit/Loss - chooses the model that maximizes the profit or minimizes the loss for

the cases in the validation data set.

You can also specify options regarding the training history and the training monitor.


97

3. Select the Basic tab. The Basic tab contains options for specifying network architecture,preliminary runs, training technique, and runtime limits.

4. Select the arrow next to Network architecture. The default network is a MultilayerPerceptron.

Hidden neurons perform the internal computations, providing the nonlinearity that makesneural networks so powerful. To set the number of hidden neurons criterion, select theHidden neurons drop-down arrow and select one of the following items:

• High noise data• Moderate noise data• Low noise data• Noiseless data• Set number

If you select the number of hidden neurons based on the noise in the data (any of the firstfour items), the number of neurons is determined at run time and based on the total numberof input levels, total number of target levels, and the number of training data rows inaddition to the noise level.

To explicitly set the number of hidden neurons, select the Set number item and type thenumber of neurons in the entry field. For this example, specify a multilayer perceptron withthree hidden neurons.


98

5. Select the drop down arrow next to Hidden neurons and select Set Number.6. Enter 3 in the field to the right of the drop arrow. Your dialog should now look like the one

pictured below.

By default, the network does not include direct connections. In this case, each input unit isconnected to each hidden unit and each hidden unit is connected to each output unit. If you setthe Direct connections value to Yes, each input unit is also connected to each output unit. Directconnections define linear layers, whereas hidden neurons define nonlinear layers. Do not changethe default settings for this example.

The network architecture field allows you to specify a wide variety of neural networks including• Generalized linear model• Multilayer perceptron (default).• Ordinary radial basis function with equal widths• Ordinary radial basis function with unequal widths• Normalized radial basis function with equal heights• Normalized radial basis function with equal volumes• Normalized radial basis function with equal widths• Normalized radial basis function with equal widths and heights• Normalized radial basis function with unequal widths and heights.

Training on usage of the neural networks is discussed at length in the course Neural NetworkModeling (course code DMNN); therefore, these architectures are not discussed due to time andspace constraints.

7. Select OK to return to the Basic tab.

The remaining options on the Basic tab enable you to specify options for• Preliminary runs - preliminary runs that attempt to identify good starting values for

training the neural network• Training technique - methodology used to iterate from the starting values to a solution.• Runtime limit - limits time spent training the network.

Use the default options for this analysis.


99

8. Select the Output tab.

9. Select the checkbox next to Training, Validation, and Test.

10. Close the Neural Network node, saving changes when prompted.11. Enter the name NN3 in the model name field when prompted.12. Select OK.

Run the flow from the Neural Network node and view the results when prompted.

The Data tab is displayed first. Additional information about the estimates, the statistics, and thedata sets is available from the drop down arrow.


100

Select the Weights tab. You may need to maximize or resize the window in order to see all ofthe weights. This table shows the coefficients used to construct each piece of the neural networkmodel.

Select the Graph subtab. The size of the square is proportional to the weight, and the colorindicates sign. Red squares indicate positive weights, and yellow squares indicate negativeweights.


101

Select the Plot tab. This plots the error on the training and validation data sets. While additionaliterations improve the fit on the training data set (top line), the performance on the validationdoes not continue to improve beyond the first few iterations. A line is drawn at the model thatperforms best on the validation data set.

Note: This plot is best viewed with the window maximized, althought that was not done forthe plot pictured above.

Close the results window.

You can use Insight to visualize the surface of this neural network1. Open the Insight node.2. Select the Validation data set.3. Select the option to use the entire data set.4. Close Insight, saving changes when prompted.5. Run Insight and view the results.6. Select Analyze È Rotating Plot (Z Y X).7. Select P_RESPO1 È Y.8. Select AGE È Z.9. Select INCOME È X.10. Select Output È At Minima.11. Select OK to return to the main rotating plot dialog.12. Select OK to generate the plot.13. Resize the display as desired.14. Right-click in the plot and select Marker Sizes È 3.

102

4.2 Visualizing Logistic Regression

A standard logistic regression model is a MLP with zero hidden layers and a logistic outputactivation function. Visualize a fitted logistic regression surface. Drag a Regression node andonto the workspace and connect it as shown below.

Modify the regression node to add prediction information to the data sets.1. Open the Regression node.2. Select the Output tab.

3. Select the Training, Validation, and Test check box.4. Close and save changes to the Regression node. By default, the Regression model is named

Untitled. You may edit this name if desired.5. Run the diagram from the Regression node but do not view the results.6. Open the Insight node.7. Select the name of the scored validation data set for the regression model. You will open this

data set from within Insight.8. Choose the option to use the entire data set if it is not already selected.9. Close Insight, saving changes when prompted.10. Run the flow from the Insight node.11. Generate a rotating scatter plot as you did in the previous section.

Note: To see the plots for the regression and neural network models simultaneously, you mustnote the name of each data set. Select one of the data sets from within the Insight node,and open the other data set from within Insight.

Chapter 5: Decision Trees

5.1 Introduction to Decision Trees..................................................................................... 105


5.3 Understanding Tree Results......................................................................................... 107

5.4 Understanding and Using Tree Options ..................................................................... 115

5.5 Interactive Training...................................................................................................... 120

5.6 Choosing a Decision Threshold ................................................................................... 125


104

105

5.1 Introduction to Decision Trees

Decision trees are widely used for predictive modeling. Decision trees have severaladvantages including the ease of interpretation, the ability to model complex input/targetassociations, and the ability to automatically handle missing values without imputation.

For interval targets, they are usually referred to as regression trees. When the target iscategorical, they are usually referred to as classification trees. This chapter covers the useof the Decision Tree node for growing and interpreting classification trees.

106


The consumer credit department of a bank wants to automate the decision makingprocess for approval of home equity lines of credit. To do this, they will follow therecommendations of the Equal Credit Opportunity Act to create an empirically derivedand statistically sound credit scoring model. The model will be based on data collectedfrom recent applicants granted credit through the current process of loan underwriting.The model will be built from predictive modeling tools, but the created model must besufficiently interpretable so as to provide a reason for any adverse actions (rejections).

The HMEQ data set contains baseline and loan performance information for 5,960 recenthome equity loans. The target (BAD) is a binary variable indicating whether an applicanteventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189cases (20%). For each applicant, 12 input variables were recorded.

Name ModelRole

MeasurementLevel

Description

BAD Target Binary 1=defaulted on loan, 0=paid back loanREASON Input Binary HomeImp=home improvement,

DebtCon=debt consolidationJOB Input Nominal Six occupational categoriesLOAN Input Interval Amount of loan requestMORTDUE Input Interval Amount due on existing mortgageVALUE Input Interval Value of current propertyDEBTINC Input Interval Debt to income ratioYOJ Input Interval Years at present jobDEROG Input Interval Number of major derogatory reportsCLNO Input Interval Number of trade linesDELINQ Input Interval Number of delinquent trade linesCLAGE Input Interval Age of oldest trade line in monthsNINQ Input Interval Number of recent credit inquiries

The credit scoring model will give a probability of a given loan applicant defaulting onloan repayment. A threshold will be selected such that all applicants whose probability ofdefault is in excess of the threshold will be recommended for rejection.

107

5.3 Understanding Tree Results

Open a new workspace from the Available Projects window and assemble the diagramshown below.

If you are interested in viewing the scoring code generated by the trees or using Insight tovisualize certain relationships, you may consider structuring your workspace as picturedbelow. This flow also makes it easy to obtain exploratory analysis(using Insight),assessment information, and scoring code for any additional models you wish to fit. Youcan simply add the modeling node to the workspace in an appropriate place and connectit to the Control point.

The choice of standard regression and decision tree models was not made arbitrarily. TheEqual Credit Opportunity Act mandates interpretability for credit scoring models.


108

Set up the Input Data Source node.1. Open the Input Data Source node.2. Select the HMEQ data set from the CRSSAMP library.3. Set the Model Role for BAD to target.4. Verify the measurement level and model role for the remaining input variables. The

Enterprise Miner assumes numeric variables with less than 10 levels in the metadatasample represent ordinal data. Check to make sure that variables such as DEROG orDELINQ have interval for their measurement level.

For this analysis, you did not set up a target profile. The data was not oversampled sothere was no need to specify prior information. Your initial study will focus on accuracy,so the default target profile is appropriate.

Examine the distribution plots for the individual variables as desired. For example,examine the distribution plot for DEBTINC. Most of the debt-to-income ratios in the dataset are less than about 45.

Select the Interval Variables tab.

Note the high missing rate for DEBTINC. More than 20% of the applicants have amissing value in this variable. Some method will be needed to handle the missing valuesin the data set.

Close the Input Data Source node, saving changes when prompted.


109

Now partition the HMEQ data for modeling. Once again, create training and validationdata sets and omit the test data.1. Open the Data Partition node.2. Set Train, Validation, and Test to 67, 33, and 0, respectively.3. Close the Data Partition node, saving changes when prompted.

For this first analysis pass, use the default settings for the Decision Tree node. Since theregression is fitting a model with all main effects, and no variable selection has beendone, the default regression model will probably be overfit. Open the regression node andmodify the node to perform variable selection.

To set up the regression node, proceed as follows:1. Open the Regression node and specify Stepwise as the method in the Selection

Method tab.2. Close the Regression node, saving changes when prompted.

Note: When modifying a modeling node for the first time, you are prompted torename the model upon exiting.

3. Enter StepReg in the model name field.

4. Select OK.


110

Run the diagram from the Assessment node and select Yes when prompted to view theresults. Generate a lift chart to compare the two models and inspect the resulting plot.

The chart shows the decision tree model dominating the regression model. The decisiontree’s first decile contains more than 80% defaulters. By comparison, the regression’sfirst decile contains only about 64% defaulters. Select the %Captured Response radiobutton.

By rejecting the worst 30% of all applications using the decision tree, you eliminateabout 78% of the bad loans from your portfolio. The regression model would requirerejecting almost half of the applications to identify this percentage of bad loans.


111

Clearly the default decision tree model is a better choice. Several questions remain• Why does the tree outperform the standard regression model?• Can you grow a better tree than the one grown using the default settings?• What threshold should be used as a cutoff for rejecting loans?

One reason for the superiority of the tree may be its innate ability to handle nonlinearassociations between the inputs and the target. To test this idea, try modeling the datawith another flexible, nonlinear regression method, a neural network. Although you maynot be able to use the neural network model for the purposes of credit scoring, it may givesome insight into the differences between the standard regression and neural networkmodels.

Add a Neural Network node the workspace and connect it after the Data Replacementnode. Connect the Neural Network node to the Assessment node. The completed diagramshould appear as pictured below.


112

Generate a lift chart for all three models and compare performance.

The neural network model does not perform appreciably better than the stepwiseregression model, and the decision tree model is still the best. Nonlinear association doesnot explain the difference between the decision tree and the stepwise regression model.Close the Assessment node.

Open the Decision Tree node results window by right clicking on the node and selectingResults. The All tab is active and displays a summary of several of the subtabs.

From the table in the lower-left corner of the tab, you see that a seven-leaf model isselected that produces a model with an accuracy of 88.92% on the validation data set.


113

View the Tree by selecting View È Tree from the menu bar. A portion of the treeappears below.

Although the selected tree was supposed to have seven leaves, not all seven leaves arevisible. By default, the decision tree viewer displays three levels deep.

To modify the levels that are visible, proceed as follows:1. Select View È Tree Options.2. Type 6 in the Tree depth down field.3. Select OK.4. Verify that all seven leaves are visible.

Observe that the colors in the tree ring diagram and the decision tree itself indicate nodepurity by default. If the node contains all ones or all zeros, the node is colored red. If thenode contains an equal mix of ones and zeros, it is colored yellow.


114

You can change the coloring scheme as follows:1. Select Tools È Define Colors.

2. Select the Proportion of a target value radio button.

3. Select 0 in the Select a target value table.4. Select OK.

Inspect the tree diagram to identify the terminal nodes with a high percentage of badloans (colored red) and those with a high percentage of good loans (colored green). Closethe tree diagram when you are done.

You can also see splitting information using the Tree Ring tab in the Results-Treewindow. Using the View Info tool, you can click on the partitions in the tree ring plot tosee the variable and cutoff value used for each split. The sizes of the resulting nodes areproportional to the size of the segments in the tree ring plot. You can see the splitstatistics by selecting View È Probe tree ring statistics. You can view a path to anynode by selecting it and then selecting View È Path.

115

5.4 Understanding and Using Tree Options

There are adjustments you can make to the default tree algorithm that will cause your treeto grow differently. These changes will not necessarily improve the classificationperformance of the tree, but they may improve its interpretability.

The Tree node splits a node into two nodes by default (called binary splits). In theory,trees using multi-way splits are no more flexible or powerful than trees using binarysplits. The primary goal is to increase interpretability of the final result.

Consider a competing tree that allows up to 4-way splits.1. Add another Tree node to the workspace.2. Connect the Data Partition node to the Tree node.3. Connect the Tree node to the Assessment node.

4. Open the Decision Tree node.5. Select the Basic tab.6. Enter 4 for the Maximum number of branches from a node field. This option will

allow binary, 3-way, and 4-way splits to be considered.

7. Close the Tree node, saving changes when prompted.8. Enter the name DT4way in the model name field when prompted to remind you that

you specified up to 4-way splits.9. Select OK.10. Run the flow from this Tree node and view the results.


116

The number of leaves in the selected tree has increased from 7 to 23. It is a matter ofpersonal taste as to whether this tree is more comprehensible than the binary split tree.The increased number of leaves suggests to some a lower degree of comprehensibility.The accuracy on the validation set is only 0.1% higher than the default model in spite ofgreatly increased complexity.

If you inspect the tree diagram, there are many nodes containing only a few applicants.You can employ additional cultivation options to limit this phenomenon.


117

Limiting Tree Growth

Various stopping or stunting rules (also known as pre-pruning) can be used to limit thegrowth of a decision tree. For example, it may be deemed beneficial not to split a nodewith less than 50 cases and require that each node have at least 25 cases.

Modify the most recently created Tree node and employ these stunting rules to keep thetree from generating so many small terminal nodes.1. Open the Tree node.2. Select the Basic tab.3. Type 25 in the Minimum number of observations in a leaf field and then press the

Enter key.4. Type 50 in the Observations required for a split search field and then press the Enter

key.Note: The Decision Tree node requires that (Observations required for a split search)≥ 2∗(Minimum number of observations in a leaf). In this example, the observationsrequired for a split search must be greater than 2∗25=50. A node with less than 50observations can not be split into two nodes, each having at least 25 observations.

5. Close and save your changes to the Tree node.Note: If the Tree node does not prompt you to save changes when you close, thesettings have not been changed. Reopen the node and modify the settings again.

6. Rerun the Tree node and view the results as before.

Inspect the results.

The optimal tree now has 8 leaves. The validation accuracy has dropped slightly to88.56%. Select View È Tree to see the tree diagram.


118

The tree diagram opens.

Note that the initial split on DEBTINC has produced four branches. You may wonderwhich branch contains the missing values. To find out, proceed as follows:1. Position the tip of the cursor above the variable name DEBTINC directly below the

root node in the tree diagram.2. Right-click and select View competing splits. The Input Selection window opens.

The table lists the top five inputs considered for splitting as ranked by a measure ofworth.

3. Select the row for the variable DEBTINC.4. Select Browse rule.


119

The Interval Variable Splitting Rule window opens.

The table presents the selected ranges for each of the four branches as well as the branchnumber that contains the missing values (in this case it is the fourth branch).

Close the Interval Variable Splitting Rule window, the Input Select window, the TreeDiagram, and the Tree-Results window.

120

5.5 Interactive Training

Decision tree splits are selected on the basis of an analytic criterion. Sometimes it isnecessary or desirable to select splits on the basis of a practical business criterion. Forexample the best split for a particular node may be on an input that is difficult orexpensive to obtain. If a competing split on an alternative input has a similar worth and ischeaper and easier to obtain, it makes sense to use the alternative input for the split at thatnode.

Likewise, splits may be selected that are statistically optimal but may be in conflict withan existing business practice. For example, the credit department may treat applicationswhere debt-to-income ratios are not available differently from those where thisinformation is available. You can incorporate this type of business rule into your decisiontree using interactive training in the Tree node.

To do interactive training, proceed as follows:1. Select the most recently edited Decision Tree node with the right mouse button.2. Select Interactive. The Interactive Training window opens.

Note: You may need to maximize your window to see all of the windows completely.

3. Select View È Tree from the menu bar.


121

The most recently fit decision tree is displayed.

Your goal is to modify the initial split so that one branch contains all the applicationswith missing debt to income data and the other branch contains the rest of theapplications. From this initial split you will use the decision tree’s analytic method togrow the remainder of the tree.

4. Select the Explore Rules icon ( ) on the tool bar.5. Select the root node of the tree. The Input Selection window opens listing a dozen

potential splitting variables and a measure of the worth of each input.

6. Select the row corresponding to DEBTINC.


122

7. Select Modify rule. The Interval Variable Splitting Rule window opens, as before.

8. Select the row for range 3.9. Select Remove range.10. Repeat steps 8 and 9 for range 2. The split is now defined to put all non-missing

values of DEBTINC into node 1 and all missing values of DEBTINC into node 2.11. Select OK to close the Interval Variable Splitting Rule window.12. Select Apply Rule in the Input Selection window. The input selection window closes

and the tree diagram is updated as shown.

The left node contains any value of DEBTINC, and the right node contains onlymissing values for DEBTINC.

13. Close the tree diagram.14. Close the Interactive Training window.


123

15. Select Yes to save the tree as input for subsequent training.

16. Run the modified Tree node and view the results. The selected tree has seven nodes.Its validation accuracy is 88.66%. The interpretation is extremely straightforward.

Close the Results-Tree window.Compare the original and modified tree model. To do so, proceed as follows:1. Open the Assessment node.2. Enter DefTree as the name for the default tree model (currently Untitled).

3. Select the row corresponding to one of the tree models.4. Press and hold the Ctrl key.


124

5. Select the row corresponding to the other tree model.

6. Select Tools È Lift Chart.7. Select Format È Model Name.

Note: You may have to maximize the window or resize the legend in order to see theentire legend.

The performance of the two tree models is not appreciably different. Close the lift chartwhen you are finished inspecting the results.

125

5.6 Choosing a Decision Threshold

Choosing an appropriate threshold for rejecting loan applications can be obtained boththeoretically and empirically. Both approaches require specification of misclassification costs.For the credit-scoring example, assume every two dollars loaned eventually returns three dollars.Rejecting a good loan for two dollars forgoes the expected dollar profit. Accepting a bad loan fortwo dollars forgoes the two-dollar loan itself (assuming that the default is early in the repaymentperiod). The theoretical approach uses the plug in Bayes rule. Using simple decision theory, theoptimal threshold is given by

positive false ofcost

negative false ofcost 1

1

+=θ

Using the cost structure defined above, the optimal threshold is simply 1/(1+2) = 1/3. That is,reject all applications whose predicted probability of default exceeds 0.33. You can obtain thesame result using the Assessment node. As a bonus, you can estimate the fraction of loanapplications you must reject when using the selected threshold.

1. Select the original binary split decision tree model in the Assessment node.

2. Select Tools È Lift Chart from the menu bar.3. Select Edit to define a target profile.4. Add a profit matrix and set it to use.5. Enter the values in the matrix as pictured below.


126

A profit matrix for credit screening is different from one for direct marketing.• In the direct marketing framework, a target value of 1 implies a purchase and thus a

positive profit. A target value of 0 implies no purchase and hence no profit. Eachsolicitation has a fixed and substantial cost.

• For credit screening a target value of 1 implies a default and hence a loss. A target valueof 0 implies a paid repaid loan and hence a profit. The fixed cost of processing each loanapplication is insubstantial and taken to be zero.

6. Close the profit matrix definition window, saving changes when prompted.7. Select Apply.

8. Select the Profit radio button.9. Select the Non-Cumulative radio button.

The plot shows the expected profit for each decile of loan applications as ranked by the decisiontree model. Both the first and second deciles have an expected profit for rejecting the applicants.Therefore, it makes sense to reject the top 20% of loan applications.

Chapter 6: Clustering Tools


6.2 K-means Clustering...................................................................................................... 130

6.3 Self-Organizing Maps (SOMs) .................................................................................... 139

6.4 Generating and Using Scoring Code........................................................................... 145


128

129


A catalog company periodically purchases list of prospects from outside sources. They want todesign a test mailing to evaluate the potential response rates for several different products. Basedon their experience, they know that customer preference for their products depends ongeographic and demographic factors. Consequently, they want to segment the prospects intogroups that are similar to each other with respect to these attributes. This process is calledstratification in survey sampling, and it is called blocking in classical experimental design.

After the prospects have been segmented, a random sample of prospects within each segmentwill be mailed one of several offers. The results of the test campaign will allow the analyst toevaluate the potential profit of prospects from the list source overall as well as for specificsegments.

The data that was obtained from the vendor is tabled below. The prospects’ name and mailingaddress (not shown) were also provided.

Name ModelRole

MeasurementLevel

Description

AGE Input Interval Age in yearsINCOME Input Interval Annual income in thousandsMARRIED Input Binary 1=married, 0=not marriedSEX Input Binary F=female, M=maleOWNHOME Input Binary 1=homeowner, 0=not a homeownerLOC Rejected Nominal Location of residence (A-H)CLIMATE Input Nominal Climate code for residence (10,20, & 30)FICO Input Interval Credit scoreID ID Nominal Unique customer identification number

Observe that all variables except ID and LOC should be set to input. No target variables are usedin a cluster analysis or SOM. If you want to identify groups based on a target variable, consider apredictive modeling technique and specify a categorical target. This type of modeling is oftenreferred to as supervised classification since it attempts to predict group or class membership fora specific categorical response variable. Clustering, on the other hand, is referred to asunsupervised classification since it identifies groups or classes within the data based on all theinput variables.

In this chapter, the formal discussion will be limited to the use of K-means clustering and Self-Organizing Maps (SOMs) to form the clusters that can be used for stratification or blocking.

130

6.2 K-means Clustering


Assemble the following diagram and connect the nodes as shown.

Setting Up the Input Data Source

Set up the initial Input Data Source as follows:1. Open the Input Data Source node.2. Select the PROSPECT data set from the CRSSAMP library.

Because the CLIMATE variable is a grouping of the LOC variable, it is redundant to use both.CLIMATE was chosen because it had fewer levels (3 versus 8) and business knowledgesuggested that these 3 levels were sufficient.

3. Set the Model Role of LOC to rejected.4. Explore the distributions and descriptive statistics as desired.

Select the Interval Variables tab and observe that there are only a few missing values for AGE,INCOME, and FICO. Select the Class Variables tab and observe that only a small percent ofdemographic variables are missing.

5. Close the Input Data Source, saving changes when prompted.

Setting Up the Replacement Node

Because of the small number of missing values, use the defaults for the Replacement node.

6.2 K-Means Clustering

131

Setting Up the Clustering Node

1. Open the Clustering node.

The Variables tab is active when you open the Cluster node. K-means clustering is very sensitiveto the scale of measurement of different inputs. Consequently, it is recommended to use one ofthe standardization options if the data has not been standardized previously in the flow.

2. Select the Std Dev. radio button on the Variables tab.

3. Select the Clusters tab.4. Observe that the default method is Automatic.

By default the Clustering node uses the Cubic Clustering Criterion (CCC) based on a sample of2,000 observations to estimate the appropriate number of clusters. You can change the defaultsample size by selecting the Data tab and then selecting the Preliminary Training and Profilestab. The Automatic selection of the number of clusters can be overridden by selecting the Userspecify radio button

5. Close the Clustering node, saving changes when prompted.


132

Run the diagram from the Clustering node and view the results.

Select the Tilt icon ( ) from the toolbar and tilt the pie chart as shown below.

Inspect the chart in the left window of the Partition tab.

This chart summarizes three statistics of the six clusters. The height of the slice indicates thenumber of cases in each cluster. Clusters 3 and 5 contain the most cases and cluster 4 containsthe fewest.


133

The right side of the window shows the Normalized Means for each input variable (mean dividedby its standard deviation). You may need to maximize or resize the window to see the completeplot. Inspect the plot at the right-hand side of the window.

Note that there are two variables associated with CLIMATE. In general, the Cluster nodeconstructs n dummy variables for a categorical variable with n levels; however, the node onlyconstructs one dummy variable for categorical variables with two levels.

Initially, the overall normalized means are plotted for each input; however, some of the variablesdo not appear in the window.

Select the scroll icon ( ) from the toolbar and scroll to view the others.


134

The Normalized Mean plot can be used to compare the overall normalized means with thenormalized means in each cluster. To do so, proceed as follows:

1. Select the Select Points icon ( ) from the toolbar.2. Select one of the clusters, say cluster 5.

3. Select the Refresh Input Means Plot icon ( ) from the toolbar.

Inspect the Normalized Mean Plot.

The circles indicate the normalized means for the selected cluster and the triangles represent theoverall normalized means. Note that cluster five• has no resident of climate zone 30 or climate zone 10• has higher than average incomes• has a higher rate of home ownership (note the lower normalized mean for OWNHOME: 0)• lower rates of marriage (note the lower mean for MARRIED: 0).

The other clusters can be compared with the overall average by repeating the aforementionedsteps.


135

Inspect the normalized mean plot for the cluster 4, the cluster with the fewest cases. Scroll in theplot, if necessary, to see CLIMATE: 10. The plot is shown below.

Observe that prospects in cluster 4• tend to be living in climate zone 10• have lower FICO scores,• are less likely to be married• tend to be younger.

Close the Cluster node when you have finished exploring the results.

The Insight node can also be used to compare the differences among the attributes of theprospects. Open the Insight node and choose the option to use the Entire data set. Close theInsight node, saving changes when prompted. Run the flow from the Insight node and view theresults.

All of the observations in the original data set are present, but the number of columns hasincreased from 9 to 11.


136

Scroll to identify the two new columns.

The column _SEGMNT_ identifies the cluster, and the column DISTANCE identifies thedistance from each observation to the cluster mean. Use the analytical tools within Insight toevaluate and compare the clusters. The following steps represent one way to make thesecomparisons:1. Change the measurement scale for MARRIED, OWNHOME, CLIMATE, and _SEGMNT_

to nominal by selecting the measurement scale directly above the variable name.

2. Select Analyze È Box Plot/Mosaic Plot.3. Highlight _SEGMNT_.4. Select X.5. Highlight CLIMATE.6. Press and hold the Ctrl key.7. Highlight MARRIED, OWNHOME, and SEX.8. Select Y.9. Select OK.


137

The Mosaic plots should appear as below. The width of the columns indicates the number ofcases in each cluster. The colors indicate the percentage of cases for each level of the variable onthe vertical axes.

CLIMATE is important in distinguishing among the clusters, with five of the six clusters entirelyor almost entirely containing cases that live in only one climate zone. Note that clusters 3 and 5contain prospects that live only in climate zone 20. Similarly, clusters 1 and 4 contain prospectsthat live mostly in climate zone 10. Consequently, they must differ by other attributes.

Clusters 3 and 5 differ substantially by the percent of married persons as do clusters 1 and 4.Cluster 6 appears to be evenly distributed between climate zones 10 and 30. Cluster 6, however,has a much higher percentage of females and unmarried persons than most of the other clusters.

The Insight node can also be used to compare the distributions of the interval inputs among theclusters to ascertain their contribution to explaining the cluster differences.1. Control Select Analyze È Box Plot/Mosaic Plot.2. Select SEGMNT È X.3. Control click to select AGE, INCOME, and FICO È Y.4. Select OK.


138

The Box plots should appear as below.

Cluster 6 appears to be the lowest income group and is among the clusters with youngermembers.

Cluster 4 contains members with lower FICO scores.

In summary, the six clusters can be described as follows.1. Married persons living in climate zone 10.2. Married persons living in climate zone 30.3. Married persons living in climate zone 20.4. Unmarried persons living in climate zone 10.5. Unmarried men living in climate zone 20.6. Unmarried women living in climate zone 20 or 30.

These clusters may or may not be useful for marketing strategies, depending on the line ofbusiness and planned campaigns.

139

6.3 Self-Organizing Maps (SOMs)

Overview of SOMs

SOMs are an analytical tool that provides a “topological” mapping from the input space to theclusters. Kohonen says that SOMs are intended for clustering, visualization, and abstraction. In aSOM, the clusters are organized into a grid. The grid is usually two dimensional, but sometimesit is one-dimensional, and (rarely) three-dimensional or higher. In the Enterprise Miner, onlyone-dimensional and two-dimensional grids are available.

The grid exists in a space that is separate from the input space; any number of inputs may beused but the dimension of the input space should be larger than the dimension of the grid. In thissituation, dimension is not referring to the number of inputs, but the number of cases orobservations. The default grid size is 4 x 6, 24 clusters. Consequently, when working on smallproblems, one should consider reducing the grid dimensions. Also, visually comparing a largenumber of clusters is difficult. Smaller grid dimensions will result in fewer clusters and makeinterpretation easier. Ease of interpretation and the usefulness of the clusters need to be balancedwith the homogeneity of the clusters.

SOMs differ from K-means clustering in the following manner. In K-means clustering, cases aregrouped together based on their distance from each other in the input space. A SOM tries to findclusters such that any two clusters that are close to each other in the grid space have seeds thatare close in the input space. The converse is not true, however. Seeds close in the input space donot necessarily correspond to seeds close in the grid space.

1. Add a SOM/Kohonen node and connect it between the Replacement node and the Insightnode. Your diagram should now look like the one pictured below.

2. Open the SOM/Kohonen node.


140

The Variables tab appears first. Inspect the options.

As in K-means clustering, the scale of the measurements can heavily influence thedetermination of the clusters. Standardizing the inputs is recommended. To facilitatecomparing the clusters from the SOM/Kohonen node to those determined in the Cluster node,choose a grid space that corresponds to six clusters.

3. Select the Std Dev. radio button from the Variables tab to standardize the input variables.

4. Select the General tab. The default method is a Batch Self-Organizing Map.

You can specify three options in the method field using the drop down arrow including BatchSelf-Organizing Map (default SOM), Kohonen Self-Organizing Map (different type of SOM),and Kohonen Vector Quantization (a clustering method).

Note that the Cluster node is recommended over Kohonen VQ for clustering. Also note that formany situations, Batch SOMs obtain satisfactory results, but Kohonen SOMs are recommendedfor highly nonlinear data.


141

Consider the following comments regarding SOMs:• larger maps are usually better, as long as most clusters have at least 5-10 cases, but larger

maps will take longer to train• the final neighborhood size should usually be set in proportion to the map size• choosing a useful map size and final neighborhood size generally requires trial and error.

Additional information on all methods is available through the online help reference. Thischapter considers only the batch SOM (default) option. For a batch SOM, the most importantoptions to specify include• the number of rows and columns in the topological map (in the General tab),• the final neighborhood size (under Neighborhood Options in the Advanced tab but not

modified for this example).

5. Use the arrows to specify 2 for the number of rows and 3 for the number of columns.

6. Close the SOM/Kohonen node and save the settings.7. Run the diagram from the SOM/Kohonen node and view the results.

The SOM/Kohonen results Map window contains two parts. The left side displays the grid. Thecolors of the clusters indicate the number of cases in each node, with red the cluster with themost cases and yellow the cluster with the least.

The Normalized Mean plot funtionality and interpretation is identical to that described in theCluster node. Browse a few of the clusters to compare the normalized means for a cluster withthe overall normalized means.


142

Select the Statistics tab to see how the rows and columns of the map relate to the cluster number.

Close the SOM/Kohonen node after inspecting these results.

Use Insight to compare the distribution of the categorical and interval inputs among the clustersin the SOM/Kohonen node. To do so, proceed as follows:1. Open Insight2. Select the option to use the entire data set.3. Specify the data set associated with the SOM/Kohonen node.4. Close the Insight node, saving changes when prompted.

Run the Insight node and view the results.1. Change the measurement scale for MARRIED, OWNHOME, CLIMATE, and SEGMNT to

nominal by selecting the measurement scale directly above the variable name.2. Select Analyze È Box Plot/Mosaic Plot.3. Highlight _SEGMNT_.4. Select X.5. Highlight CLIMATE.6. Press and hold the Ctrl key.7. Highlight MARRIED, OWNHOME, and SEX.8. Select Y.


143

9. Select OK.

Observe the following:• no cluster contains persons living in more than one climate zone. In fact, climate zones 30

and 10 are equivalent to clusters 1 and 5, respectively• all of the prospects in four of the clusters live in climate zone 20. Consequently, they must

differ by other attributes• cluster 2 contains homeowners, most of whom are married• cluster 3 contains unmarried women• cluster 4 contains married renters• cluster 6 contains unmarried males.

Now compare the distributions of the interval inputs to understand the cluster differences.1. Control Select Analyze È Box Plot/Mosaic Plot.2. Select SEGMNT È X.3. Select AGE.4. Press and hold the Ctrl key.5. Select INCOME, and FICO È Y.6. Select OK.


144

The box plots appear.

Observe that• persons in cluster 3 and 6 tend to be younger than those in the other clusters (recall that

cluster 3 and cluster 6 consisted almost entirely of unmarried persons)• FICO score does not appear to differ among the clusters.• INCOME varies more that FICO score, but not substantially.

A summary of attributes of the six clusters follows.1. Map(1,1) -- persons living in climate zone 30.2. Map(1,2) --married homeowners living in climate zone 20.3. Map(1,3) -- unmarried women living in climate zone 20.4. Map(2,1) -- married renters living in climate zone 20.5. Map(2,2) -- persons living in climate zone 10.6. Map(2,3) -- unmarried men living in climate zone 20.

The number ordering of the clusters is arbitrary, but the map coordinates are not. SOMs differfrom K-means in the fact that clusters close to each other in the map space are more similar toeach other than those further apart. This property is most obvious when comparing Map(1,3)with Map(2,3) where these two clusters differ by only gender. The reason that Map(1,1) andMap(1,2) are adjacent in the map space is not so obvious. Examining the mosaic plots, one cansee that these two clusters are very similar in their gender distribution.

145

6.4 Generating and Using Score Code

Add a Score! node below the Insight node. Connect the Cluster node and the SOM/Kohonennode to the Score! node. Your diagram should appear like the one pictured below.

Run the Score! node and inspect the scoring code. At first glance, the scoring code for bothnodes appear to be essentially the same; however, the details of the formulas that calculate thedistance of a case from each cluster seed is substantially different. Consequently, the clusterassignments can be substantially different.


146

Chapter 7: Association Analysis


7.2 Understanding Association Results ............................................................................. 150

7.3 Dissociation Analysis .................................................................................................... 153


148

149


A bank seeks to examine its customer base and understand which of its products the samecustomer owns. It has chosen to conduct a market-basket analysis of a sample of its customerbase.

The BNKSERV data set lists the banking products/services used by 7,991 customers. Thirteenpossible services are represented:

ATM automated teller machine debit cardAUTO automobile installment loanCCRD credit cardCD certificate of depositCKCRD check/debit cardCKING checking accountHMEQLC home equity line of creditIRA individual retirement accountMMDA money market deposit accountMTG mortgagePLOAN personal/consumer installment loanSVG saving accountTRUST personal trust account.

There are 24,375 rows in the data set. Each row of the data set represents a customer-servicecombination. The median number of services per customer is three.

150

7.2 Understanding Association Results

Construct the following diagram.

Specify the settings for the Input Data Source node.1. Open the Input Data Source node.2. Select the BNKSERV data set from the CRSSAMP library.3. Set Model Role for ACCT to id and for SERVICE to target.4. Close and save changes to the Input Data Source node.

Open the Association node. The Variables tab is active by default and lists the same informationthat is found in the Variable tab in the Input Data Source node.

Select the General tab. This tab enables you to modify the analysis mode and control how manyrules are generated.

Inspect the Analysis mode options. Observe that the Analysis mode is set to By Context bydefault.

7.2 Understanding Association Results

151

Understanding Analysis Modes

The default analysis mode is By Context. This mode uses information specified in the input datasource to determine the appropriate analysis. If the input data set contains• an id variable and a target variable, the node automatically performs an association analysis• a sequence variable that has a status of use, the node performs a sequence analysis.

A sequence analysis takes into account the order in which the items are purchased incalculating the associations. A sequence analysis requires the specification of a variablewhose model role is sequence. An association analysis ignores this ordering.

Other options include• Minimum Transaction Frequency to Support Associations - specifies a minimum level of

support to claim that items are associated (that is, occur together in the database). Thedefault frequency is 5%.

• Maximum number of items in an association - determines the maximum size of the item setto be considered. For example, the default of 4 items indicates that up to 4-way associationsare performed.

• Minimum confidence for rule generation - specifies the minimum confidence level togenerate a rule. The default level is 10%. This option is grayed out if you are performing asequence discovery.

Use the default Association settings. Run the diagram from the Association node and view theresults. The Rules tab is displayed first.

The Rules tab contains information for each rule. Consider the rule A=>B, then the• Support of A=>B is the probability that a customer has both A and B• Confidence of A=>B is the probability that a customer has B given that the customer has A.• Lift of A=>B is a measure of strength of the association. If the Lift=2 for the rule A=>B,

then a customer having A is twice as likely to have B than a customer chosen at random.


152

Click on the Support(%) column with the right mouse button and select Sort È Descending.

The support is the percentage of customers that have all the services involved in the rule. Forexample, 54% of the 7,991 customers have a checking and savings account and 25% have achecking account, savings account, and an ATM card. Click on the Confidence(%) column withthe right mouse button and select Sort È Descending.

The confidence represents the percentage of customers who have the right-hand-side (RHS)item among those who have the left-hand-side (LHS) item. For example, all customers whohave a check card also have a checking account, and 97.81% of those with a mortgage also havea checking account and a CD.

Lift, in the context of association rules, is the ratio of the confidence of a rule to the confidenceof a rule assuming the RHS was independent of the LHS. Consequently, lift is a measure ofassociation between the LHS and RHS of the rule. Values greater than one represent positivecorrelation between the LHS and RHS. Values equal to one represent independence. Values lessthan one represent negative correlation between the LHS and RHS.

Close the Association node.

153

7.3 Dissociation Analysis

A dissociation rule is a rule involving the negation of some item. For example, the LHSmay be not having a checking account and the RHS might be an auto loan. Dissociationrules may be particularly interesting when the items involved are highly prevalent. TheAssociation node will include dissociation rules if the data is modified to include thenegation of selected items. The SAS Code node can be used for such data modification.

Creating Dissociations

Augment the data with services not present in each account.1. Disconnect the Input Data Source and the Association node.2. Drag a SAS Code node to the workspace and connect it between the Input Data

Source and the Association node. The diagram should appear as shown.

3. Open the SAS Code node.4. Select the Macros tab. Observe that the name of the training data set is (&_TRAIN).

You can use the macro names in the programs used in this node. It is unnecessary toknow the exact name that the Enterprise Miner has assigned to each data set.


154

5. Select the Export tab.

6. Select Add È TRAIN. Note that the name of the data set is (&_TRA).7. Deselect Pass imported data sets to successors.8. Select the Program tab.9. Select File È Import file and a browser opens. Use the browser to locate the

program Dissoc.sas that is in the same directory as the raw data for this class.10. Highlight the program Dissoc.sas.11. Select OK.12. Modify the first four lines of the imported program as shown.

%let id=ACCT;%let target=SERVICE;%let values=’SVG’,’CKING’,’MMDA’;%let in=&_TRAIN;%let out=&_TRA;

The first two lines identify the target and id variables, respectively. The third lineidentifies the values of the target for which negations are created. The values must beenclosed in quotes and separated by commas. The final two lines provide generic macronames for the training data and the augmented (exported) data.


155

This SAS program scans each id (ACCT) to see if the items (services) specified in thevalues are present. If not, the data is augmented with the negated items.1. Close the SAS Code node.2. Run the diagram from the SAS Code node but do not view the results.3. Open the Association node.4. Select the Data tab.5. Select Properties and then select the Table View tab. The listing is of the

augmented data that was exported from the SAS Code node.6. Close the Association node.7. Run the Association node and view the results.

The results now list association and dissociation rules. For example, among customerswithout a money market account, 65.58% have a savings account (rule 4).

Close the Association node.


156

Node Cloning

You can add custom nodes to the tools palette for tasks such as the data modificationsneeded for dissociation rules. Clone the SAS Code node and add it to the Node typespalette. To do so, proceed as follows.1. Select the SAS Code node.2. Select Edit È Clone.

3. Type Dissociations in the Description field.

4. Select the right arrow next to the Image field.


157

5. Select an appropriate icon from the palette.

6. Select OK to accept the image.7. Select OK to close the clone node.8. Select the Tools Tab from the left side of the Enterprise Miner application window.

Scroll down to the bottom of the tools. A new tool appears at the bottom of theTools palette with the icon you selected.

The cloned tool can be used in the diagram in place of the SAS Code node. Notethat this cloned node has variable and level names that are specific to theBNKSERV data set. One may prefer to clone a node prior to modifying theprogram. A cloned tool is saved in the project library. Consequently, everydiagram created within the project will have the Create Dissociations nodeavailable for use.

SASPDF_VERYIMP

Documents

Transcript of SASPDF_VERYIMP