Predictive Modelling Using E-Miner

Predictive Modeling Using

Enterprise Miner

Course Notes

Predictive Modeling Using Enterprise Miner Course Notes was developed by Jim Georges. Additional contributions were made by Bob Lucas, Mike Patetta, and Will Potts. Editing and production support was provided by the Curriculum Development and Support Department.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

Predictive Modeling Using Enterprise Miner Course Notes

Copyright 2002 by SAS Institute Inc., Cary, NC 27513, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

Book code 58575, course code PMEM, prepared date 25Sep01.

For Your Information iii

Table of Contents

Course Description .............................................................................................................v

Prerequisites......................................................................................................................vi

General Conventions........................................................................................................vii

Chapter 1 Introduction to Predictive Modeling .........................................................1-1

1.1 Starting the Analysis .............................................................................................1-3

1.2 Preparing the Tools ..............................................................................................1-15

1.3 Constructing a Predictive Model .........................................................................1-23

1.4 Adjusting Predictions ...........................................................................................1-43

1.5 Making Optimal Decisions...................................................................................1-49

1.6 Parametric Prediction ..........................................................................................1-62

1.7 Tuning a Parametric Model .................................................................................1-72

1.7 Comparing Predictive Models..............................................................................1-82

1.8 Deploying a Predictive Model ............................................................................1-101

1.9 Summarizing the Analysis.................................................................................1-112

Chapter 2 Flexible Parametric Models .......................................................................2-1

2.1 Defining Flexible Regression Models ....................................................................2-3

2.2 Constructing Neural Networks............................................................................2-14

2.3 Deconstructing Neural Networks ........................................................................2-25

Chapter 3 Predictive Algorithms ................................................................................3-1

3.1 Constructing Trees .................................................................................................3-3

3.2 Constructing Trees ...............................................................................................3-23

iv For Your Information

3.3 Applying Decision Trees.......................................................................................3-28

Appendix A Exercises.................................................................................................... A-1

A.1 Introduction to Predictive Modeling..................................................................... A-3

A.2 Flexible Parametric Models .................................................................................. A-9

A.3 Predictive Algorithms ......................................................................................... A-13

For Your Information v

Course Description

Predictive Modeling Using Enterprise Miner™ is the foundation for further courses in the data mining curriculum. It is designed to give data analysts the necessary skills to build successful predictive models. This course provides the skills to build effective predictive models using Enterprise Miner. Methods for overcoming common data mining challenges are illustrated on actual business data.

To learn more…

A full curriculum of general and statistical instructor-based training is available at any of the Institute’s training facilities. Institute instructors can also provide on-site training.

For information on other courses in the curriculum, contact the Professional Services Division at 1-919-677-8000, then press 1-7321, or send email to [email protected]. You can also find this information on the Web at www.sas.com/training/ as well as in the Training Course Catalog.

For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our Book Sales Department at 1-800-727-3228 or send email to [email protected]. Customers outside the USA, please contact your local SAS Institute office.

Also, see the Publications Catalog on the Web at www.sas.com/pubs for a complete list of books and a convenient order form.

vi For Your Information

Prerequisites

Before attending this course, you should be familiar with simple regression modeling concepts have some experience with creating and managing SAS data sets, which you can gain from the Getting Started with SAS® Software: A Nonprogramming Approach course or the SAS® Programming I: Essentials course.

For Your Information vii

General Conventions This section explains the various conventions used in presenting text, SAS language syntax, and examples in this book.

Typographical Conventions

You will see several type styles in this book. This list explains the meaning of each style:

UPPERCASE ROMAN is used for SAS statements, variable names, and other SAS language elements when they appear in the text.

italic identifies terms or concepts that are defined in text. Italic is also used for book titles when they are referenced in text, as well as for various syntax and mathematical elements.

bold is used for emphasis within text.

monospace is used for examples of SAS programming statements and for SAS character strings. Monospace is also used to refer to field names in windows, information in fields, and user-supplied information.

select indicates selectable items in windows and menus. This book also uses icons to represent selectable items.

Syntax Conventions

The general forms of SAS statements and commands shown in this book include only that part of the syntax actually taught in the course. For complete syntax, see the appropriate SAS reference guide.

PROC CHART DATA= SAS-data-set; HBAR | VBAR chart-variables </ options>; RUN;

This is an example of how SAS syntax is shown in text: • PROC and CHART are in uppercase bold because they are SAS keywords. • DATA= is in uppercase to indicate that it must be spelled as shown. • SAS-data-set is in italic because it represents a value that you supply. In this case,

the value must be the name of a SAS data set. • HBAR and VBAR are in uppercase bold because they are SAS keywords. They are

separated by a vertical bar to indicate they are mutually exclusive; you can choose one or the other.

• chart-variables is in italic because it represents a value or values that you supply. • </ options> represents optional syntax specific to the HBAR and VBAR statements.

The angle brackets enclose the slash as well as options because if no options are specified you do not include the slash.

• RUN is in uppercase bold because it is a SAS keyword.

viii For Your Information

Chapter 1 Introduction to Predictive Modeling

1.1 Starting the Analysis ......................................................................................................1-3

1.2 Preparing the Tools ......................................................................................................1-15

1.3 Constructing a Predictive Model.................................................................................1-23

1.4 Adjusting Predictions ..................................................................................................1-43

1.5 Making Optimal Decisions...........................................................................................1-49

1.6 Parametric Prediction ..................................................................................................1-62

1.7 Tuning a Parametric Model ..........................................................................................1-72

1.7 Comparing Predictive Models .....................................................................................1-82

1.8 Deploying a Predictive Model....................................................................................1-101

1.9 Summarizing the Analysis ......................................................................................... 1-112

1-2 Chapter 1 Introduction to Predictive Modeling

1.1 Starting the Analysis 1-3

1.1 Starting the Analysis

Analytic Objective

Starting the Analysis

Predictive Modeling

Results Integration

Data PreparationData Preparation

Results Integration

The task of predictive modeling does stand by itself. To build a successful predictive model you must first – unambiguously – define an analytic objective. The predictive model serves as a means of fulfilling the analytic objective.

The predictive modeling effort is surrounded by two other tasks. Before modeling begins, data must be assembled, often from a variety of sources, and arranged in a format suitable for model building. After the modeling is complete, the resulting model (and the modeling results) must be integrated into the business environment that originally motivated the modeling. These tasks often require more effort than the modeling itself.

This course, therefore, is the middle of a trilogy of courses. The data preparation tasks are assumed complete. The integration tasks remain unexplored. Both tasks deserve a course of their own.


Analytic Objective Examples

Risk assessment

Attrition

Up-sell and cross-sell

Lifetime value

Response

Analytic objectives that involve predictive modeling are only limited by your imagination. In Enterprise Miner’s intended domain, however, many fall in one of a limited number of categories (Par Rud, 2001). • Response models attempt to identify individuals likely to respond to an offer or

solicitation. • Up-sell and cross-sell models are used to predict the likelihood of existing

customers wanting additional products from the same company. • Risk assessment models quantify the likelihood events that will adversely affect a

business. • Attrition models gauge the probability of a customer taking his or her business

elsewhere. • Lifetime value models evaluate the overall profitability of a customer over a

predetermined length of time.

To achieve a particular object, it is often necessary to combine several predictive models. For example, a lifetime value model must not only account for the money a customer will spend but also the length of time the customer will continue spending.

In this course, attention centers on a single analytic objective category: response modeling. Many of the concepts learned translate directly to other types of problems.


Modeling Example

From population of lapsing donors, identify individuals worth continued solicitation.

Business:

Objective:

National veterans’ organization

Source: 1998 KDD-Cup Competition via UCI KDD Archive

A national veterans’ organization seeks to better target its solicitations for donation. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns. Solicitations involve sending a small gift to an individual together with a request for donation. Gifts include mailing labels and greeting cards.

The organization has more than 3.5 million individuals in its mailing database. These individuals have been classified by their response behavior to previous solicitation efforts. Of particular interest is the class of individuals identified as lapsing donors. These individuals have made their most recent donation between 12 and 24 months ago. The organization has found that by the predicting the response behavior of this group, they can use the model to rank all 3.5 million individuals in their database. With this ranking, a decision can be made to either solicit or ignore an individual in the current solicitation campaign. The current campaign refers to a greeting card mailing sent in June of 1997. It is identified in the raw data as the 97NK campaign.

The source of this data is the Association for Computing Machinery’s (ACM) 1998 KDD-Cup competition. The data set and other details of the competition are publicly available at the UCI KDD Archive at http://kdd.ics.uci.edu.


Data Preparation

Donor Master

Demographics

Transaction Detail

Raw Analysis Data

95,412 Records481 Fields

Before a predictive model can be built to address the organization’s analytic objective, an analysis data set must be assembled. Usually an analysis data set is assembled from multiple source data sets. Examples of the source data sets include a donor master table containing facts about individual donors, demographic overlay data from external data vendors or public sources (like the U.S. Census Bureau), and transaction detail tables that capture the flow of information and money to and from the organization.

Using a variety of summarization and transformation techniques, these data sets were combined to form a raw analysis data set. The defining characteristic of the analysis data set is the presence of a single record for each individual in the analysis population. The KDD-Cup supplied data, called cup98lrn.txt on the UCI website, is an example of a raw analysis data set. It contains more than 95,000 records and almost 500 fields. Each field provides a fact about an individual in the veterans’ organization’s donor population.


Additional Data Preparation

Raw Analysis Data


Final Analysis Data


The raw analysis has been reduced for the purpose of this course. A subset of just over 19,000 records has been selected for modeling. As will be seen, this subset was not chosen arbitrarily. In addition, the 481 fields have been reduced to 50. Considering their potential association with the analysis objective eliminated some of the fields (for example, it is doubtful that CD player ownership is strongly correlated with donation potential). Others fields were combined to form summaries of a particular customer behavior.

Appendix 1 presents the program used to transform the raw analysis data into the final analysis data.

It is important to obtain some understanding of the composition of the data before modeling. The following describes the origin and source of the newly created variables.

Analysis Data Definition

CONTROL_NUMBERMONTHS_SINCE_ORIGININ_HOUSE

Donor master data

Unique Donor IDElapsed time since first donation1=Given to In House program,0=Not In House donor

The donor master data contributes three fields to the final analysis data. The control number uniquely identifies each member of the analysis population. The number of


months since origin is a field derived from the first donation date. A final field identifies donors who are part of the organization’s In House program.


OVERLAY_SOURCEDONOR_AGEDONOR_GENDER

Demographic and other overlay data

M=Metromail, P=Polk, B=bothAge as of June 1997Actual or inferred gender

PUBLISHED_PHONEHOME_OWNER

Published telephone listingH=homeowner, U=unknown

MOR_HIT Mail order response hit rate

The next fields come from demographic and external vendor overlays. By matching on the donor’s name and address (found in the donor master file), information about the donor can be obtained from commercial data vendors like Metromail and Polk. Most of these fields are self explanatory with the exception of the mail order response hit rate. This field counts the number of known responses to mail order solicitations from all known sources.

PER_CAPITA_INCOME Income per capita in dollarsMED_HOUSEHOLD_INCOME Median income in $100’s



CLUSTER_CODESES

WEALTH_RATING

54 Socio-economic cluster codes5 Socio-economic cluster codes

10 wealth rating groups

INCOME_GROUP 7 income group levels

Intuitively, an association should exist between affluence and largesse. Based on this intuition there are six separate fields in the final analysis data set that capture some aspect of wealth. The socio-economic field SES is a roll-up of the socio-economic field CLUSTER_CODE. Income group divides individuals into seven income brackets. Median household income and per-capita income are from U.S. Census


data aggregated to the census block level. Wealth rating is a field that measures the wealth of an individual relative to others in his or her state.

PCT_OWNER_OCCUPIED Percent owner occupied housing



MED_HOME_VALUE Median home value in $100’s

URBANICITY U=urban, C=city, S=suburban,T=town, R=rural, ?=unknown

Another potential discriminator of donation potential is captured in facts about an individual’s domicile. Median home value and percent owner occupied data are taken from U.S. Census data. Urbanicity classifies an individual address into one of five unbanization categories.

Census overlay data


PCT_MALE_MILITARY Percent male military in blockPCT_MALE_VETERANSPCT_VIETNAM_VETERANS

Percent male veterans in blockPercent Vietnam veterans in block

PCT_WWII_VETERANS Percent WWII veterans in block

The raw modeling data contains almost 300 fields taken from the 1990 U.S. Census. These fields describe the demographic composition of an individual’s neighborhood. While vast, the predictive potential of seven-year-old U.S. Census data is limited. Thus, the final analysis data includes only four of these fields.


Number card promotions last 12 mos.


Time`94 `97`96`95 `98

Transaction detail data

NUMBER_PROM_12 Number promotions last 12 mos.CARD_PROM_12

97NK

All of the individuals included in the modeling population have donated to the veterans’ organization before. The transaction detail data captures these donations. While most of the fields described thus far are aggregate measures applied to the individual, the information captured in the transaction detail file speaks directly to the behavior of the individual. It is therefore, perhaps, the richest source of information about future donation potential.

The transaction detail data is aggregated over various time spans. The more recent data is shown here. According to the data’s documentation, these fields refer to the total number of promotions and card promotions received between March 1996 and March 1997. Because 97NK is itself a card promotion or mailing, separating this count from the overall total will distinguish individual more responsive to card promotions.


Time`94 `97`96`95 `98

97NK

Transaction detail data

FREQ_STATUS_97NK Frequency status, June `97RECENCY_STATUS_96NK Recency status, June `96

96NKLAST_GIFT_AMT Amount of most recent donationMONTHS_SINCE_LAST Months since last donation

The frequency status for the 97NK campaign is defined to be the number of donations received between June of 1995 and June of 1996. It is coded as 1, 2, 3, or 4, where 4 implies four or more donations.


Recency status as of June 1996 classifies individuals into one of six categories. The categories are defined as follows:

F First time donor. Anyone who has made their first donation in the last six months and has made only one donation.

N New donor. Anyone who has made their first donation in the last 12 months and is not a first time donor.

A Active donor. Anyone who has made their first donation more than 12 months ago and has made a donation in the last 12 months.

L Lapsing donor. Anyone who has made their last donation between 12 and 24 months ago.

I Inactive donor. Anyone who has made their last donation more than 24 months ago.

S STAR donor. Anyone who has given to three consecutive card mailings.

Months since last donation and last gift amount describe the most recent donation. In theory, all individuals in the modeling population are lapsing donors as of the 97NK mailing. This implies that none have made a donation between June 96 and June 97. However, for a limited number of cases, the number of months since last gift is fewer than 12. This contradiction is not resolved in the data’s documentation, nor will it be resolved here.


Time`94 `97`96`95 `98

94NK

RECENT transaction detail data

RESPONSE_PROP Response proportion since June `94RESPONSE_COUNT

96NK

AVG_GIFT_AMTResponse count since June `94Average gift amount since June `94

RECENT_STAR_STATUS STAR (1, 0) status since June `94

Moving further back in time, the next fields describe the donation behavior between June 1994 and June 1996. Recent response proportion measures the ratio of donations to solicitations. Recent response count counts the total number of solicitations in the time frame. Recent average gift amount takes the total dollars donated in the time frame and divides by the number of donations. Recent star status indicates whether an individual achieved STAR status between June 1994 and June 1996.



Time`94 `97`96`95 `98

94NK

RECENT transaction detail data

CARD_RESPONSE_PROP Response proportion since June `94CARD_RESPONSE_COUNT

96NK

CARD_AVG_GIFT_AMTResponse count since June `94Average gift amount since June `94

These fields are similar to the previous, but they describe only card mailings. They are included to distinguish individuals who are more responsive to card promotions.


Time`94 `97`96`95 `98

94NK

LIFETIME transaction detail data

PROM Total number promotions everGIFT_COUNT

96NK

AVG_GIFT_AMTTotal number donations everOverall average gift amount

PEP_STAR STAR status ever (1=yes, 0=no)


GIFT_RANGE Maximum less minimum gift amount


Time`94 `97`96`95 `98

94NK

LIFETIME transaction detail data

GIFT_AMOUNT Total gift amount everGIFT_COUNT

96NK

Total number donations everMAX_GIFT Maximum gift amount

These variables summarize behavior over the lifetime of the individual’s association with the veterans’ organization. Most are self-explanatory with the exception of the STAR status ever indicator. Analysis shows that there are individuals with recent STAR status who do not have PEP_STAR=1. This may indicate an error in the data or some change in the definition of STAR status in the past.

MONTHS_SINCE_LAST Last donation date from June `97


Time`94 `97`96`95 `98

94NK

KDD supplied LIFETIME transaction detail data

FILE_AVG_GIFT Average gift from raw dataFILE_CARD_GIFT

96NK

MONTHS_SINCE_FIRST First donation date from June `97Average card gift raw data

Several fields in the raw data were derivable from other fields in the data but were nevertheless included. Curiously, the derived values and the provided values do not always agree. Since it is impossible to determine which are “correct,” these supplied values were also included in the final analysis data.



Transaction detail data target definition

Time`94 `97`96`95 `98

97NK

TARGET_B Response to 97NK solicitation (1=yes 0=no)TARGET_D Response amount to 97NK solicitation

(missing if no response)

The final two fields are the two most important in the entire analysis data set. They describe the response behavior to the 97NK campaign. The models to be built will attempt to predict their value in the presence of all the other information in the analysis data set.

1.2 Preparing the Tools 1-15

1.2 Preparing the Tools

In this course, you build predictive models to help decide which lapsing donors are likely to reactivate. Much of this work will be accomplished via Enterprise Miner, SAS’ premium data mining tool.

The Enterprise Miner interface simplifies many common tasks associated with the construction of predictive models. The interface is divided into three interface components:

• EM Tools Bar – contains a customizable subset of Enterpriser Miner tools that are commonly used to build process flow diagrams in the Diagram Workspace. You can add or delete tools from the Tools Bar.

• Project Navigator – enables you to manage projects and diagrams, add tools to the Diagram Workspace, and view HTML reports that are created by the Reporter node.

• Diagram Workspace – is the area for building, editing, running, and saving process flow diagrams.

Enterprise Miner Interface

EM Tools Bar

DiagramWorkspace

Current ProjectDiagram Tools

Result Summaries

Project Navigator

The Project Navigator is organized into three tabs. The Project tab lists the current project and its associated diagrams. The Tools tab lists Enterprise Miners modeling tools. Tools are added into the Diagram Workspace and are subsequently referred to as nodes. Arrows connect the nodes to define a process flow diagram. Including a Reporter node in the diagram produces a report summarizing the diagram’s settings and results. Reports generated by a Reporter node are listed in the Project Navigator’s Reports tab.

The Enterprise Miner interface is a component of the SAS System. To open the Enterprise Miner window, you must first initiate a SAS session on your PC. In this way, all the features of SAS are available to Enterprise Miner, including its powerful programming language. In this way, you can extend the capabilities of Enterprise Miner to include any operation programmable in SAS.


Enterprise Miner Analytic Processing

Client-only operationRaw Data

•SAS Data Sets•Data Warehouse•DBMS

Client PCProject Data

Intermediate Data

In the simplest configuration, the SAS System and the Enterprise Miner interface are run on a client PC. In client-only operation, all data processing and analysis occurs on this PC. Predictive models are constructed from raw data read in from SAS data sets, data warehouses, or other DBMS tables. The raw data tables do not need to reside on the client PC.

As the analytic processing proceeds, Enterprise Miner creates additional data sets. Some of these data sets describe facts about the project itself (diagram setup, node topology, node settings, and so on). Some contain transformed versions of the original raw data. For client only operation of Enterprise Miner, both types of data are typically stored on the client PC.

Enterprise Miner Analytic ProcessingRaw Data

•SAS Data Sets•Data Warehouse•DBMS

Client PCProject Data

EM ServerIntermediate

Data

Sampled Intermediate Data

In more advanced, client-server installations of Enterprise Miner, analytic processing tasks are divided across two computers. The Enterprise Miner interface and its


associated SAS session run on a client PC. However, most data and analytic processing is handled by a separate installation of SAS and Enterprise Miner on a (presumably powerful) server.

A typical server has the capability to rapidly process very large data sets. For efficiency, access to raw data and storage of intermediate data occurs on the server. To allow for modeling even in the absence of a server connection, a small sample of the raw data and intermediate data sets are transferred to the client PC.


Creating a Client-Only Project

Your first task is to define a new Enterprise Miner project. Start a SAS session on your PC. With a SAS session initiated, you must next start the Enterprise Miner interface. There are two ways to do this: • From the SAS menu bar, select Solutions Analysis Enterprise Miner.

• At the SAS command box, type miner and then press the Enter key.


The Enterprise Miner interface appears.

The default Enterprise Miner project is client-only. Project and intermediate data is stored in a location prescribed by the installation of SAS. You can use this default project, but it is often useful (for backup and sharing) to create a new project in another location.

1. Select the Enterprise Miner interface window.

2. From the menu bar, can select File New Project…. The Create new project window opens.

This window allows you to name and locate a new Enterprise Miner project.

3. Enter PVA Project in the Name field and select Create. The Project Navigator now shows a project named PVA Project with a diagram named Untitled.

Behind the scenes, Enterprise Miner has created a new directory called PVA Project. Within this new directory are three subdirectories: emdata, emproj, and reports. Emdata is a repository for the intermediate results files generated by the analysis. Emproj contains operational information pertaining to the newly defined project. Reports is a repository for any HTML reports generated within the project.

4. In the Project Navigator window, select the diagram Untitled and enter the new name PVA Analysis.

5. Double-click the newly named PVA Analysis diagram icon. The Diagram Workspace changes from gray to white. You can now add Enterprise Miner tools to the Workspace.


Accessing Raw Modeling Data

Now use Enterprise Miner to build a model to predict the value of TARGET_B based on the supplemental facts contained in the 97NK training data. As outlined above, the first task is to access the raw modeling data. In Enterprise Miner, raw modeling data is usually accessed using the Input Data Source tool.

Place an Input Data Source node in the Diagram Workspace:

1. Select the Tools tab in the Project Navigator. A list of Enterprise Miner diagram tools appears.

2. Drag and drop the Input Data Source tool into the Diagram Workspace. When complete, the Enterprise Miner interface appears as follows:

To point the newly placed Input Data Source node to the raw 97NK data set, you need to change the node’s settings.

1. Double-click the Input Data Source node. The Input Data Source window opens.


2. Select Select…. The SAS Data Set window opens with the SASUSER library selected.

3. Select the Library pop-up menu and select CRSSAMP .

If you do not have the CRSSAMP library defined, you need to create a SAS library pointing to the raw data. To do this:

a. Close the SAS Data Set window by clicking in the upper-right corner or by selecting OK.

b. Type libassign in the SAS command box and press Enter. The New Library window opens.

c. Type CRSSAMP in the Name field.


d. Enter the path (or browse) to the course data directory.

e. Select Enable at startup. This makes CRSSAMP a permanent SAS library.

f. Select OK.

4. Select PVA_RAW_DATA from the CRSSAMP library

5. Select OK.

The Input Data Source window reopens filled with information about the PVA_RAW_DATA data set.

A link to the raw modeling data has been established. The Input Data Source node will import the raw data, process it (as will be seen in Section 1.3), and output a data object called EMDATA.VIEW_XXX to subsequent nodes.

1.3 Constructing a Predictive Model 1-23

1.3 Constructing a Predictive Model

Predicting the Unknown

?Expected

TargetValue

Input Measurements

Interval 20.00,12.50, 5.00, …

Ordinal Lower, Middle, Upper

Nominal CA, GA, NY, TX, …

Binary F, M

MeasurementScales

The fundamental problem in prediction is the correct determination of an unknown quantity in the presence of supplementary facts. In this course, the unknown quantity is called a target and the supplementary facts are called inputs.

The inputs and target typically represent measurements of an observable phenomenon. The measurements found in the input and target variables are recorded on one of several measurement scales. Enterprise Miner recognizes the following measurement scales for the purposes of model construction.1 • Interval measurements are quantitative values permitting certain simple arithmetic

or logarithmic transformations (for example, monetary amounts). • Ordinal measurements are qualitative attributes having an inherent order (for

example, income group). • Nominal measurements are qualitative attributes lacking an inherent order (for

example, state or province). • Binary measurements are qualitative attributes with only two levels (for example,

gender).

To solve the fundamental problem in prediction, a mathematical relationship between the inputs and the target is constructed. This mathematical relation is known as a predictive model. Once established, the predictive model can be used to produce an estimate of an unknown target value given a set of input measurements.

1 Additional measurement scale categories are commonly found in the scientific literature. See Sarle (1996).


Training Data

PreviouslyObserved

Cases

?

Construction of predictive models requires training data, a set of previously observed input and target measurements, or cases. The cases of the training data are assumed to be representative of future (unobserved) input and target measurements. 2

An extremely simplistic predictive model assumes all possible input and target combinations are recorded in the training data. Given a set of input measurements, you need only to scan the training data for identical measurements and note the corresponding target measurement.

Often in a real set of training data, a particular set of inputs corresponds to a range of target measurements. Because of this noise, predictive models usually provide the expected (average) value of the target for a given set of input measurements. With a qualitative target, (ordinal, nominal, or binary) the expected target value may be interpreted as the probability of each qualitative level. Both situations suggest that there are limits to the accuracy achievable by any predictive model.

Usually, a given set of input measurements does not yield an exact match in the training data. How you compensate for this fact distinguishes various predictive modeling methods.

Perhaps the most intuitive way to predict cases lacking an exact match in the training data is to look for a nearly matching case and note the corresponding target measurement. This is the philosophy behind nearest-neighbor prediction and other local smoothing methods.

2 In statistical terms, all cases in the training data set are assumed to be independent (that is, the measurements in one case were not affected by the measurements of one or more other cases) and the underlying distribution of the inputs and targets is stationary (not changing in time). The failure of either assumption results in poor predictive performance.


Nearest Neighbor Prediction

Nearest Neighbor =

Nearest Neighbor =

Training Data

Input1

Input2 Decision Boundary

Nearest neighbor prediction (classification) has a long history in the statistical literature starting in the early ‘50s. However, you could argue that its philosophical roots date back (at least) to the taxonomists of the 19th Century.

In its simplest form, the predicted target value equals the target value of the nearest training data case. You can envision this process as partitioning the input space, the set of all possible input measurements, into “cells” of distinct target values. The edge of these cells, where the predicted value changes, is known as the decision boundary. A nearest neighbor model has a very complex decision boundary.

Generalization

Training Data Validation Data

Accuracy = 100% Accuracy = 63%

A model is only as good as its ability to generalize new cases. While nearest neighbor prediction perfectly predicts training data cases, performance on new cases (validation data) may be substantially worse. This is especially apparent when the data are noisy (every small region of the input space contains cases with several distinct target values). In the slide above, the true value of a validation data cases are indicated by dot color. Any case whose nearest neighbor has a different color is incorrectly predicted, indicated by a red circle surrounding the case.


Tuning a Predictive Model

100%

60%

TrainingValidation

Neighborhood Size

Accuracy

Training Data

Most predictive modeling methods possess tuning mechanisms to improve generalization. One way to tune a nearest neighbor models is to change the number of training data cases used to make a prediction. Instead of using the target value of the nearest training case, the predicted target is taken to be the average target values of the k nearest training cases. This interpolation makes the model much less sensitive to noise and typically improves generalization.

In general, models are tuned to match the specific signal and noise characteristics of a given prediction problem. When there is a strong signal and little noise, highly sensitive models can be built with complex decision boundaries. Where there is a weak signal and high noise, less sensitive models with simple decision boundaries are appropriate. In Enterprise Miner, monitoring model performance on validation data usually determines the appropriate tuning value.

Curse of Dimensionality

Training Data

AdditionalExtraneous

Input

Another way to tune predictive models is by choosing appropriate inputs for the model. This choice is critical. Including extraneous inputs (that is, those unrelated to the target) can devastate model performance. This phenomenon, known as the curse


of dimensionality, is the general observation that the complexity of a data set increases with dimension.

Cases that are nearest neighbors in two dimensions need not be nearest neighbors in three dimensions. When only two of the three dimensions are related to the target, this can degrade the performance of nearest neighbor model.

8410

Nearest Neighbors?

Training Data

ExtraneousInputs

Nearest Neighbors

As the number of extraneous inputs increases, the problem becomes worse. Indeed, in high dimensions, the concept of “nearest” becomes quite distorted.

Suppose there are 1000 cases scattered randomly but uniformly on the range 0 to 1 of 10 independent inputs. On any one input, an interval of length ½ centered at ½ contains about 500 cases. Now take any pair of inputs. How many of the cases are in the center half of both inputs? If the inputs are independent, as assumed, the answer is about 250. For three inputs, there are about 125, and so on. Perhaps surprisingly, with ten inputs only about 1 case out of 1000 is simultaneously in the center half of all inputs. Put another way, 99.9% of the cases are on the outer edges of this 10-dimensional input space.

To maintain some sense of nearness in high dimensions requires a tremendous increase in the number of training cases. For example, a square region containing about 10 of 1000 cases in two dimensions will have sides of length 1/10. To get about 10 cases in a region with sides of length 1/10 in 10 dimensions (assuming a uniform distribution) requires, on the averages, more than 1,000,000,000 cases!


Falling Under the Curse

Training Data

100%

60%

TrainingValidation

Neighborhood Size

Accuracy

With two relevant and eight extraneous inputs, the accuracy of the nearest neighbor algorithm decreases, even on the training data.

Matters are worse in a typical prediction problem: for every relevant input there may be dozens of extraneous ones. This devastates the performance of nearest neighbor methods and begs the question of how to proceed.

Breaking the Curse

Predictive Algorithms

Parametric Models

PartitioningModels (Trees)

RegressionModels

Example

To overcome the curse of dimensionality, one must utilize predictive model techniques that capture general trends in the data while ignoring extraneous information. To do this, the focus must shift from individual cases in the training data to general pattern they create.

Two approaches are widely used to overcome the curse of dimensionality. Predictive algorithms employ simple heuristic rules to reduce dimension. Parametric models are constrained to limit overgeneralization. While this classification is used to group predictive models for the purposes of this course, the distinction is somewhat


artificial. Predictive algorithms often utilize predictive models; predictive models often employ predictive algorithms.

Decision Rule

Training Data

Create Models to• Extol the Obvious• Ignore the Extraneous

Example:

Accuracy = 73%

• Simple Decision Rule

A example of a predictive algorithm is a simple decision rule. In the example above, a single partition of the input space can lead to a surprisingly accurate prediction. This partition takes advantage of the clustering of solid cases on the right half of the original input space. It isolates cases with like-valued targets in each part of the partition.

Recursive Partitioning

Training Data

100%

50%Partition Count

Accuracy

TrainingValidation

It is not hard to devise techniques to search for and isolate cases with like-valued valued targets (and many researchers from many independent disciplines have!). The common element of these techniques is the recursive partitioning of the input space.

Partitions of the training data, based on the values of a single input, are considered. The worth of a partition is measured by how well it isolates distinct groups of target values. The input/partition combination with the highest worth is selected and the training data is accordingly split. The process continues by further subdividing each


resulting split group. Ultimately, the satisfaction of certain stopping conditions terminates the process.

A predictive model can be built from the partitioning process by averaging target in each final partition group and assigning this average to every case in the group.

The number of times the partitioning process repeats can be thought of as a tuning parameter for the model. Each iteration subdivides the training data further and increases training data accuracy. However, increasing the training data accuracy often diminishes generalization. As with nearest neighbor models, validation data can be used to pick the optimal tuning value.

Recursive partitioning techniques resist the curse of dimensionality by ignoring inputs not associated with the target. If every partition involving a particular input results in partition groups with similar average target values, the calculated worth of these partitions will be small. The particular input is not selected to partition the data and it is effectively disregarded.

Because they can quickly identify inputs with strong target associations, recursive partitioning methods are ideally suited to the role of initial predictive modeling methodology.

The task that motivates predictive modeling in this course has been outlined in Section 1.1. Lapsing donors have been identified by basic business rules. Some of these donors will be subsequently ignored; some will continue to be solicited for donation. A data set describing the donation response to a mailing (identified as 97NK) will be used to make this decision.

The simplest approach to this problem involves estimating donation propensity from the 97NK data. Individuals with the highest probability of response are selected for continued solicitation. Those with the lowest probability of response are ignored in the future. For now, the amount of response enters into the solicitation decision after the propensity to donate is estimated.

The unknown target for this model is a binary variable, TARGET_B, that indicates donation response to the 97NK mailing. Other variables in the training data provide supplemental facts about each individual.


Defining Modeling Metadata

Metadata, while not yet defined in any popular dictionary, means data about data sets. Some metadata, such as field name, are stored with the data. Other metadata, such as how a particular variable in a data set should be used in a predictive model, must be manually specified. Defining modeling metadata is the process of establishing relevant facts about the data set prior to model construction.

Select the Variables tab of the Input Data Source window.

To construct a model predicting donation propensity, two metadata fields must be completed defined:

Model Role defines the function of each variable in a predictive model. Typical values for Model Role include input, target, id, and rejected.

Measurement distinguishes continuous scale variables from categorical variables. Continuous variables have Measurement set to interval. Categorical variables have Measurement set to unary, binary, ordinal, or nominal depending on the number and type of values assumed by the variable.

Enterprise Miner assigns default values for both Model Role and Measurement. These values are determined from the metadata sample, a sample of data created when the raw modeling data is initially accessed. A 2000 record metadata sample affects the default values for Model Role and Measurement as follows:

Model Role Measurement Metadata sample contents

input interval numeric variable with more than 10 distinct non-missing values

input ordinal numeric variable with 3 to 10 distinct non-missing values (or numeric variable with certain SAS formats)

input binary variable with 2 distinct non-missing values


input nominal character variable (or numeric variable with certain SAS formats) having more than 3 distinct non-missing values

id nominal character variable with number of distinct values greater than 90% of the metadata sample size

rejected unary variable with fewer than 2 distinct values

rejected interval numeric date or datetime variable

To build a predictive model, one or more variables must be identified as the target. To build a response propensity model, change the Model Role of TARGET_B to target.

1. Right-click the Model Role field for TARGET_B.

2. Select Set Model Role from the pop-up menu.

3. Select target.

Other changes need to be made to the initial metadata assignments. For example, the donation propensity model does not use TARGET_D. To build a meaningful predictive model, change the following additional metadata fields:

Variable Metadata field Value

1 TARGET_D Model Role rejected

2 INCOME_GROUP Measurement interval

3 FREQUENCY_STATUS_97NK Measurement interval

Occasionally, the 2000-record metadata sample fails to detect more than 10 distinct non-missing values on certain fields. This results in Enterprise Miner setting Measurement for the field to ordinal instead of interval. To obtain results consistent with these notes, any field with Measurement set to ordinal should be changed to interval.

This completes the initial assignment of the metadata for the donation propensity model. Close the Input Data Source window.

1. Select the Input Data Source window’s close box, .


The Save Changes prompt window opens.

2. Select Yes.

The Input Data Source window closes and the Enterprise Miner interface is selected.


Building a Predictive Model

The metadata in the Input Data Source node identifies 47 inputs for predicting donation propensity. Not all of these inputs will be needed to build a successful predictive model.

In the absence of prior experience with the donation data, your first modeling goal is to identify which of the 47 inputs are most strongly related to donation propensity. Given its ability to ignore extraneous inputs, a recursive partitioning model is ideally suited to this task.

The Tree tool is the primary recursive partitioning method for Enterprise Miner. It enables you to construct several types of decision tree models. To use the Tree tool:

1. Scroll the Tools tab in the Project Navigator to show the Model tools group.

2. Drag and drop the Tree tool into the Diagram Workspace, to the right of the Input Data Source node.

Two tasks have been defined in the PVA Analysis: reading the raw data and building a decision tree. For Enterprise Miner to actually perform these tasks, you must specify the order in which to do them. Order is established by connecting the nodes with arrows to form process flow diagrams.

Draw an arrow from the Input Data Source node to the Tree node. This instructs Enterprise Miner to perform the Input Data Source node task first and then perform the Tree node task. To draw the arrow:

1. Click in the Diagram Workspace away from both nodes. This causes Enterprise Miner to deselect any selected node.

2. Move the cursor next to the Input Data Source node until it changes to a crosshair.

3. Click and drag a line to the Tree node.


4. Click the Diagram Workspace away from both nodes.

The diagram appears as follows:

If no line forms, click away from the nodes and try again. Adding connection arrows can require a little practice.

Now that Enterprise Miner knows the order in which to process the nodes, the next task is to actually invoke the process.

1. Right click on the Tree node. A menu of node options appears.

2. Select Run.

Enterprise Miner begins to build a predictive model. Progress through the process flow diagram is indicated by a green square. Upon successful completion of the tree model, a dialog box opens.

3. Select Yes. The Results-Tree window opens.

The Results-Tree window summarizes the results of the recursive partitioning model fit by Enterprise Miner. It is partitioned into four components.

1. The Summary Table (upper left) describes how well the Tree model predicts the individual levels of TARGET_B. From left to right, the columns contain the type of summary, the actual values of TARGET_B, the quantity (counts or percentages) predicted into each target level, and the sum of the quantity columns.

2. The Tree Ring Diagram (upper right) describes how the training data is partitioned. The center of the diagram indicates the entire data set. Concentric rings indicate how the cases are partitioned versus increasing model complexity.


The color indicates the accuracy of predictions within a partition (red=high accuracy, yellow=low accuracy).

3. The Assessment Table (lower left) tabulates model complexity and accuracy.

4. The Assessment Plot (lower right) plots the Assessment Table.

As model complexity increases (here indicated by the number of leaves), the accuracy of the model also increases. The chosen model complexity, 25 leaves, has a prediction accuracy of 75.59%. This can be read either from the Assessment Table or the Assessment Plot. The Summary Table shows that of the 4,843 cases with TARGET_B=1, the model correctly predicts 279. Likewise, of the 14,529 cases with TARGET_B=0, the model correctly predicts 14,364. Clearly the model is much more successful at predicting TARGET_B=0 than TARGET_B=1. Understanding why requires looking at the model in more detail.

The Tree tool in Enterprise Miner takes its name from the usual presentation of recursive partitioning models.

1. From the SAS menu bar, select View Tree. The Tree Diagram window opens.

The tree diagram summarizes the recursive partitioning of the data. Each box is called a node. The top box, or root node, shows donations (TARGET_B=1) in 25% or 4,843 of the cases.

The first partition (of the entire training data) separates the most enthusiastic donors from the rest. In the enthusiastic PEP_STAR=1 group, 29.5% of the cases made donations to the 97NK campaign. In the less enthusiastic PEP_STAR=0 group, 20.4% of the cases made donations.

Further partitioning into subgroups occurs. The tree structure presentation shows the inputs and values used to make the partitions as well as proportion of donation within each subgroup.

2. To examine some of the extremes in donation propensity, scroll to the lower- left corner of tree.


A node in a decision tree that is not partitioned is called a terminal node or leaf. The proportion of cases in each target level in a terminal node provides the expected value of the target in a recursive partitioning model.

The terminal node at the lower left shows donations in more than 85% of the cases. This proportion is used as the predicted value of TARGET_B. The model has identified a subgroup with an almost 20-fold increase in the odds of donation versus the entire training data. Unfortunately, only 30 cases fall into the subgroup.

3. Scroll to the center of the tree.

Here you find a node showing no donation in about 85% of the cases. This much larger group of cases shows a 50% decrease in the odds of donation compared to the entire training data.

Perhaps surprisingly, this low-donation-probability node is not in fact a terminal node. By default, Enterprise Miner’s Tree tool only displays three levels of partitioning. This is indicated in the title bar of the Tree Diagram window.

1. To see the entire tree structure, from the SAS menu bar, select View Tree Options…. The Set Tree Depth Options window opens.


Although it is difficult to read, the Tree depth down field is set to 3.

2. Change the Tree depth down field to 6.

3. Select OK. The Set Tree Depth Options window closes and the Tree Diagram window is updated to display the entire tree structure.

Detailed inspection of the entire tree structure reveals subgroups with even greater extremes of donation propensity than those already discussed. It is tempting to concoct stories to explain these extremes based on the partition rules found in the tree. While these stories would be true, they would apply only to the training data used by Enterprise Miner to build this particular Tree model. In general, they would not generalize to the entire population of potential donors.


Tuning a Predictive Model

The previous discussion reveals a flaw in the present modeling approach. Without a set of validation data, it is difficult to assess which of the partitions are meaningful and which are training-data-inspired fantasies. Fortunately, Enterprise Miner provides a convenient way to generate a validation data set.

1. Close the Tree Diagram window and the Results-Tree window. A dialog box opens.

2. Select Yes. This saves the adjustment to the tree depth made above. The PVA Analysis window opens.

3. Drag a Data Partition tool onto the diagram and drop on the arrow connecting the Input Data Source node to the Tree node. The arrow splits and the diagram appears as below.

Dragging and dropping on an existing arrow installs a new node in the diagram only if the arrow is exactly vertical or horizontal. If not, you need to delete the existing arrow and draw two new arrows as before.

The Data Partition node breaks a raw data set into components from training, tuning, comparing, and testing predictive models. For this modeling exercise, you need to adjust some of the node’s default settings.

1. Double-click the Data Partition node. The Data Partition window opens.


The fields on the right determine the percentages of the raw data to be used for training, validation, and testing (final performance evaluation). A separate data set will be used later for testing. Therefore, only training and validation data sets are required.

In this exercise, half the data is used for training and half for validation.

2. Type 50 in the Train and Validation fields.

3. Type 0 in the Test field.

4. Close the Data Partition window and save the changes. The PVA Analysis window opens.

The diagram is ready to be run again. Instead of using the entire raw data set for training, Enterprise Miner now reserves half the data for model tuning and assessment.

1. Right-click the Tree node and select Run. Because no changes were made to the Input Data Source node, processing will commence at the Data Partition node.

2. View the results when the run is complete. The Results-Tree window opens.

Several changes to the Results-Tree window are apparent. The Summary Table now includes a column describing the data source for each row (TRAIN or VALID). The Assessment Table includes accuracy information for both the training and the validation data. The Assessment Plot includes separate lines for training data (blue) and validation data (red).

The results shown in the Assessment Plot are troublesome. The accuracy on the validation data is uniformly higher than that of the training data. This is certainly counterintuitive.

The cause is a subtle flaw in the way in which the cases were partitioned into training and validation sets. The proportion of donors in the two sets is different. To most easily see this, view the tree representation of the model.

3. From the SAS menu bar, select View Tree. The Tree Diagram window opens.


In the training data, 25.3% of the cases are donors. In the validation data, the proportion is 24.7%. This difference skews the accuracy statistics. Correcting for this requires a slight adjustment to the way in which the raw data is partitioned.

1. Close the Tree Diagram window and the Results-Tree window.

2. Open the Data Partition node.

3. Change the method to stratified. The Stratification tab in the Data Partition window is ungrayed.

4. Select the Stratification tab. A list of the variables with categorical measurement scales appears.

Selecting a stratification variable forces Enterprise Miner to balance category levels across the training, validation, and test sets.

5. Set the status for TARGET_B to use. The training and validation data sets now contain a similar proportion of donors.

6. Close the Data Partition window and save changes to the Data Partition settings.

With the partitioning flaw corrected, it is safe to rebuild the Tree model.

1. Right-click the Tree node and select Run.

2. View the results when the run is complete.


The Assessment plot now shows typical behavior. As model complexity increases, performance improves on both training and validation data and then diverges. The simplest model that maximizes accuracy on the validation data has seven leaves.

3. Select View Tree.

Inspection of the seven terminal nodes reveals donation proportions ranging from 16% to more than 80% in the training data. With one (minor) exception, similar donation propensities are also observed in the validation data. The model isolates donors from non-donors and shows accuracy improvement (compared to no model) on both training and validation data.

The astute student, however, may pose two objections to the present state of affairs: • Even the least generous subgroup seems to have an unrealistically high donation

propensity. With such a high donation probability, it may not be worth building a model: simply solicit everyone!

• While it is true that the selected 7-leaf model has higher accuracy than the simplest 1-leaf model, in absolute terms this increase is minimal. From the Assessment Table, the validation accuracies of the 1-leaf and 7-leaf models equal, respectively, 75% and 75.13%.

Both objections are reasonable. The first is simply an artifact from the commonly used predictive modeling practice called separate sampling. The second correctly observes that predictive accuracy is not necessarily the best measure of a model’s worth.

1.4 Adjusting Predictions 1-43

1.4 Adjusting Predictions

Separate Sampling

Benefits:• Helps detect rare target levels• Speeds processing

Risks:• Biases predictions (correctable)• Increases prediction variability

In many predictive modeling problems, the target level of interest occurs rarely relative to other target levels. For example, in a data set of 100,000 cases, only 1,000 may contain an interesting value of the target. A widespread predictive modeling practice in this situation creates a training data set with cases sampled separately from each target level. When the number of interesting and rare cases is fixed, all such cases are selected. Then, separately, a simple random sample of common cases is added. The size of the common sample should be at least as large as the rare sample and is frequently several times larger. The basic idea is that a model built from training data with three or four times as many common cases in the training data produces a model just as predictive as one built from training data with 30 to 40 times as many common cases.

Often this practice can help predictive models better detect the rare levels, especially when the total number of target levels exceeds two. When separately sampled, more weight is given to the rare cases during model construction, increasing a model’s sensitivity to the rare level. It also places less demands on the computer used to build the models, speeding model processing.

Unfortunately, the predictions made by any model fit with separately sampled training data are biased. As seen on the next slide, this is easily corrected. More troublesome is the increased variability of the models built from the separately sampled data. The more flexible the model, the worse this problem can be. On the other hand, it may be unwise to use highly flexible models in presence of a rare target level; the effective sample size is much closer to the number of cases with the rare level than it is to the total number of available cases. Flexible models built from actually or effectively small training samples typically show poor generalization.


Adjusting Predictions

Sample Population

Prediction

Model predictions are easily adjusted to compensate for separate sampling. The only requirement is prior knowledge of the proportions of the target levels in the population from which the cases are drawn.

Within a given target level, each case in the training data corresponds to a certain number of cases in the population. Predictions about the population can be obtained from predictions based on the training data by adjusting for this correspondence.

For example, consider a target with two levels, 0 and 1. The proportion of cases with target level 1 in the actual population can be obtained from the proportion of cases with target level 1 in the training sample using the formula

( )( ) ( )111000

1111 ~~

~

ρπρπρπpp

pp+

=

where

p1 is the population proportion of level 1.

0~p and 1

~p are the training proportions of levels 0 and 1, respectively.

π0 and π1 are the overall proportions of level 0 and 1 in the population (called the population prior).

ρ0 and ρ1 are the overall proportions of level 0 and 1 in the training data.


Specifying Population Priors

The 97NK raw data has a 25% overall donor proportion. This was achieved by separately sampling the 95,412-case donor population data. First, all 4,843 cases (about 5%) with TARGET_B=1 were selected. Then, for each cases with TARGET_B=1, three cases with TARGET_B=0 were randomly chosen from the population data. This resulted in a raw analysis data set with 19,372 cases.

The probability estimates in the decision tree were based on the separately sampled training data. Given prior knowledge of the overall population donor proportion, Enterprise Miner can adjust these estimates to reflect the true population.

Specification of this prior knowledge occurs in the Input Data Source node.

1. View the Diagram Workspace by closing any open results window.

2. Open the Input Data Source node.

3. Select the Variables tab.

4. Right click on TARGET_B in the Name column and select Edit Target Profile…. A dialog box opens.

5. Select Yes. The Target Profiles for TARGET_B window opens.

Enterprise Miner uses target profiles to store target metadata, such as the population, or prior, target level proportions. A target profile is keyed to a data set and a target variable name. Edit the prior information for TARGET_B in the PVA_RAW_DATA.

1. Select the Prior tab.


Currently None (no prior) is selected.

2. Right-click in the white rectangle below the word *None and select Add from the pop-up menu. A new, editable prior called Prior Vector is added to the list of available priors.

3. Select Prior vector. Right-click and select Set to use from the pop-up menu. The selection asterisk * is moved from None to Prior Vector.

4. Type PVA Prior Vector in the Name field and press Enter. The name in the prior vector list changes.

When typing in an Enterprise Miner text field, it is always a good idea to press Enter after typing. Failure to do so sometimes results in Enterprise Miner ignoring the information you typed.

5. Change the prior probability for TARGET_B=1 from 0.25 to 0.05.

6. Change the prior probability for TARGET_B=0 from 0.75 to 0.95.

The completed changes should appear as shown.

7. Close the Target Profiles for TARGET_B window and save the changes.

8. Close the Input Data Source window and save the changes.

Enterprise Miner now adjusts all model predictions to conform to the specified prior. You should expect to see donation proportions on the order of 5% instead of 25%. This will have some surprising consequences.

1. Run the Tree node and view the results.


This simplest tree with the highest validation accuracy is simply the root node. Inspection of the Summary Table reveals that the target level 0 has been predicted for all cases. From the Assessment Table, such a prediction results in an accuracy of 95%.

2. Open the Tree Diagram window.

The root node shows the proportion of donors in the training and validation data. The proportion conforms to the prior probabilities defined in the target profiler. Note that both the proportions and the counts have been adjusted to match the population.

3. Adjust the location of the Tree Diagram window so that it is possible to view it and the Tree – Results window simultaneously.

4. Select the second row of the Assessment Table in the Tree – Results window. The Tree Diagram window updates to show a two-leaf tree.


The data is split on the input RECENT_RESPONSE_COUNT. The donation proportion in the right branch is nearly twice that of the left branch.

5. Select the third row of the Assessment Table in the Tree – Results window. The Tree Diagram window updates to show a three-leaf tree. (Scroll to view the added leaves).

The cases with high RECENT_RESPONSE_COUNT are split according to LAST_GIFT_AMT. Cases with LAST_GIFT_AMT < $9.50 have a donation proportion around 9%.

The donor proportions exhibited in the tree are considerably lower than those before prior adjustment. These proportions, however, now accurately reflect the true donation proportions in population of PVA contributors.

For each level of complexity, the tree models appear to have the same level of accuracy. Are none of the models better at predicting response than a model that predicts no donors? Clearly the two- and three-leaf trees find subgroups with different donation proportions. Yet, the donation proportion in all of these subgroups is so small that the most accurate prediction is always no donation.

The problem here lies not with the Tree models, but with the mechanism used to assess them. Clearly, one donation can pay for many solicitations. Assessing a model on accuracy alone fails to incorporate this fact. To properly tune a predictive model requires attention to the decisions that will be made using the model and their associated profitability.

1.5 Making Optimal Decisions 1-49

1.5 Making Optimal Decisions

Decisions Profits

Target Decision Profit

TargetP

PPP

P P PP

¼ ¾ +P P

E( Profit )

¼ ¾ +P P

E( Profit )

choose the larger

Predictive models are most valuable when they are used to decide a course of action. This requires quantifying the decision consequences in terms of a profit or cost.

Given the true value of the target for a case, it is usually straightforward to evaluate the profit resulting from a decision. For example, soliciting a certain donor who always contributes $5 results in a net profit of $5 minus the package cost. Ignoring this individual results in zero profit. Similarly, soliciting a donor who always ignores such offers results in a net loss equal to the package cost. Because of the certainty of these two cases, deciding the best course of action is obvious.

Of course, predictions are typically far from certain. Deciding the best course of action for a case scored by a predictive model requires calculating the expected profit of each decision alternative. Assuming a constant profit Pdl for each target level, l, the expected profit for decision alternative d is given by E(Profitd) = Σ plPdl, where pl is the probability target level l. The optimal decision corresponds to the highest expected profit.


p

ppp + pp P P P P+

E( Profit )

Making Optimal Decisions


E( Profit )

P

PPP

P P PP

DecisionThreshold

p 01

E(Profit) vs. p

choose the larger

With two target levels and the simple profit structure considered here, the expected profits vary linearly with target level probability. At some point on the range [0,1], the expected profits are equal. The probability associated with this point is known as the decision threshold.

Overall Average Profit


PP

P P n

n

n

n

PAverage Profit = n + Pn + Pn + Pn( ) / N

The worth of a predictive model can be appraised by calculating the overall average profit on a set of validation data. Using the profit structure defined above, the overall average profit equals Σ (ndlPdl )/ N, where ndl is the number of cases subject to decision d and having level l, and N is the total number of cases in the validation data.


Example: Accuracy Rule


110

0p 01

choose the largerp p = 1 - p

p

p

Accuracy, the most obvious measure of a model’s worth, is a special case of the general profit structure defined on the previous slide. Correct decisions are rewarded with a 1-unit profit. Incorrect decisions receive zero profit. The expected profit for a decision simply equals the probability of the corresponding target level. Thus, the best decision for each case corresponds to the most likely target value.

Example: Accuracy Rule Profit


110

0p 01

n

n

Average Profit = ( n + n ) / N

n

n

The overall average profit is the total number of correctly decided cases divided by the total number of cases. This is the definition of accuracy. So, maximizing overall average profit with this simple profit structure is identical to maximizing accuracy.


Distribution ofpredicted

target values

Example: Extreme Decision Rules


110

0p 01

0

0

n

n

Average Profit = n / N = π

Note that, for accuracy profit structure, the decision threshold occurs at 0.5. As you saw in the previous demonstration, such a large threshold may result in the same decision for all cases, especially when the probability of one target level is small. This is an example of an extreme decision rule. The overall average profit for extreme decision rules is completely determined by the prior target level probabilities. Thus, the predictive model provides no information about the association between the inputs and target. This is true even if the model is correctly specified and correctly estimated. The problem lies not with the model, but with the decision rule. In short, when one target level is rare, predictive accuracy is an inappropriate model appraisal method.

-1

Example: Conforming Decision Rules


3 00

E(Profit) vs. p


target values

p 01p

p

choose the larger3p - p 0

For predictive models to be interesting and useful, the decision threshold should be similar in value to the predicted probabilities for the primary target level. When this is the case, the profit structure defines a conforming decision rule. With a conforming decision rule, each decision alternative pertains to some cases.


When an accurate profit structure is not known, it is better to evaluate a model with a conforming decision rule than a potentially extreme decision rule like accuracy. A diagonal profit matrix with 1/πl, the inverse of the prior proportion for each target level, on the main diagonal usually assigns some cases to each decision alternative. For a two-level target, this diagonal profit matrix yields a decision threshold equal to the population prior.

-1

Example: Conforming Rule Profit


3 00

E(Profit) vs. p


target values

p 01

n

n

n

n

Average Profit = ( 3n - n ) / N

For a conforming decision rule, some cases are assigned to each decision alternative. When one of the columns of the profit matrix is entirely zeros, only the cases with predicted probabilities in excess of the profit threshold contribute to the overall average profit. The remaining cases are effectively ignored.

Example: Extreme Decision Rules


99-1

00

E(Profit) vs. p


target values

p 01

99p - p 0choose the larger

0p

p

It is possible to have an extreme decision rule even when the decision profits are correctly specified. For example, if the average donation amount is much larger than the package cost, the resulting decision rule may also be extreme. In this situation, a single donation pays for many solicitations. The best decision is to solicit everyone!


Example: Extreme Rule Profit


99-1

00

E(Profit) vs. p


target values

p 01

Average Profit = ( 99 n - n ) / N

n

n

0

0

= 99 π - π

With extreme decision rules the utility of any predictive model is limited. The decision threshold is so small, all cases have a predicted probability in excess of the threshold. The overall average profit is determined entirely by the prior probabilities of the target.

Defining the profit structure for a model occurs at the definition of the analytic objective. By carefully considering the profit consequences of modeling decisions and comparing the resulting decision thresholds to the prior target level proportions, you can identify an extreme decision rule before any modeling occurs. A large discrepancy between the estimated decision threshold and the target level priors may suggest there is little reason for building a model: the model will not affect the optimal decision for your entire modeling population.


Examining Response Amount

To properly tune and assess the decision tree model (or any predictive model), you must correctly define a decision profit matrix. In the PVA model, two decisions are possible: solicit for donation and ignore. The decision to solicit results in a profit equal to the donation amount less the package cost ($0.68) for TARGET_B=1 and loss equal to $0.68 for TARGET_B=0. The decision to ignore results in zero profit, regardless of the true target value.

The variable TARGET_D records the donation amounts for those who responded to the 97NK campaign. Unfortunately, the value of TARGET_D will be unknown when deciding the appropriate action for a case. Like donation propensity, however, it can be predicted from the training data. Because current focus is the donation propensity model, construction of a sophisticated donation amount model is deferred to the course Advanced Predictive Modeling Using Enterprise Miner. For now the simplest of donation amount models will suffice: the expected or average value of TARGET_D where TARGET_B=1.

In the course of building a predictive model, you must frequently consult the training data for guidance. Here you would like to estimate the expected value of TARGET_D. The most convenient way to explore the training data in Enterprise Miner is with the Insight tool.

1. Close the Tree Diagram window and Results-Tree window.

2. Drag-and-drop an Insight tool below the Data Partition node.

3. Drag an arrow from the Data Partition node to the Insight node.

4. Double-click the Insight node. The Insight Settings window opens.


5. Select the Entire data set button.

Be cautious when selecting the Entire data set option. By default, Insight provides a 2000-case sample of the training data. You have just changed this default. When run, the Insight node reads the entire training data set into the memory of the client computer. When the training data set is large or is not already on the client machine, this can result in significant processing delays.

6. Select the toolbar button. This prepares the training data for viewing with Insight.

7. Select Yes to view the results. A spreadsheet view of the training data opens.

Insight provides a wealth of features for data exploration. Here you will use it to explore the distribution of TARGET_D.

1. Select TARGET_D in the speadsheet.

2. From the SAS menu bar, select Analyze Distribution (Y). An Insight Distribution window opens.


The Distribution window contains a thorough summarization of the TARGET_D distribution. At the top is an outlier box plot, useful for identifying common and uncommon values of TARGET_D. Next comes a histogram, followed by a table of moments, and finally a table of quantiles.

The outlier box plot shows several cases with unusually high donation amounts (the black dots to the right of the green bar). The Moments table shows a mean donation amount of $15.30 in the training data. The Quantiles table shows a median donation amount of $13. The difference in these two measures of centrality is consistent with the skewed distribution displayed in the histogram (and the large value of the skewness statistic in the moments table). While the amount $13 is more representative of a typical donation, the average donation of $15.30 is used in the profit matrix, shown below. This is consistent with the definition of expected profit used in Enterprise Miner.

Decision solicit ignore

1 14.62 0 TARGET_B

0 -0.68 0

The value of $14.62 is the expected donation amount $15.30 minus the $0.68 package cost. If 14.62p1 – 0.68p0 > 0, then the most profitable decision is solicit, where p1 and p0 = 1 – p1 are the model-predicted probabilities of response. You can solve this equation for p1 to obtain the decision threshold θ = 0.0444. Any case with a donation probability in excess of 0.0444 receives a solicitation. This is close to the prior donation proportion of 0.05, so the defined decision rule seems to be conforming.


Defining a Profit Matrix

Because the profit matrix, like the prior vector, affects many nodes in Enterprise Miner, its definition occurs in the input data source node.

1. Close the Insight data table and the Insight Settings window.



4. Right-click on the TARGET_B variable and select Edit Target Profile. The Target Profiles window opens.

5. Select the Assessment Information tab.

6. Right-click in the white rectangle below the word Default loss and select Add from the pop-up menu. A new, editable profit matrix called Profit Matrix is added to the list of available assessment measures.

7. Select Profit Matrix. Right-click and select Set to use from the pop-up menu. The selection asterisk * is moved from Profit vector to Profit Matrix.

8. Type PVA Profit Matrix in the Name field and press Enter. The name in the assessment measures list changes.


The currently defined profit matrix corresponds to the accuracy rule. To change this to a conforming profit matrix, make the following changes:

1. Type 14.62 in the upper-left cell of the profit matrix.

2. Type –0.68 in the lower-left corner of the profit matrix.

3. Type 0 in the lower-right corner of the profit matrix. The completed changes should appear as shown below.

4. Close the Target Profiles window and save the changes.

5. Close the Input Data Source window and save the changes.


Training a Profitable Tree Model

With an appropriate profit matrix defined, refit the tree model.

1. Right-click the Tree node and select Run.

2. View the modeling results. The Results-Tree window opens.

The Assessment Table and Assessment Plot now display overall average profit instead of accuracy. The selected 11-leaf tree model has an overall average profit of $0.1674 per case on the training data and $0.1355 on the validation data. While the overall average profits seem small, you must remember that PVA has over 3.5 million names in its database. The $0.1355 overall average profit translates to nearly $500,000 when applied to these cases.

3. View the Tree Diagram.

The Tree Diagram window, adjusted slightly to better show results, shows the first partition occurring on RECENT_RESPONSE_COUNT. For those cases with fewer than three recent responses, the proportion of donors is less than


or equal to 4.0% on both training and validation data. Profit and decision calculations are presented at the bottom of the node. Taking decision 1, solicit, results in an expected loss of $0.1173 and $0.0754 on the training and validation data respectively. Taking decision 0, ignore, results in 0 profit. Although there is no profit to be gained, the best decision for individuals in the node is ignore. Similarly, the best decision for the right node with three or more recent responses is solicit. It should be noted that both nodes are subpartitioned, so the best decision in each node serves only to illustrate the profit-based decision concept.

4. Scroll the tree to find high profit nodes and no-profit nodes.

5. When you are done exploring the tree model, close the Tree Diagram and Tree-Results windows.


1.6 Parametric Prediction

E(Y | X=x) =g(x;w)p (x)

Parametric Models

Training Data

E(Y | X=x) = g(x;w)

Generalized Linear Model

w0 + w1x1 +…+ wpxp)

w1

w2

g-1( )

Nearest neighbor and recursive partitioning models are both very general predictive modeling techniques that make few structural assumptions about the relationship between inputs and target. In contrast, parametric models, an alternative class of techniques, often make very strong assumptions about this relationship. This limits their susceptibility to the curse of dimensionality.

In a parametric model, the expected value of the target Y is related to the inputs x=(x1, x2,…, xp) via the relation E(Y | X=x) = g(x,w), where w=(w0, w1,…,wd) is a vector of parameters. Generally, the number and values of elements in the parameter vector modulate the model complexity.

A simple parametric modeling form restricts variation of the target’s expected value to a single direction in the input space. This direction is defined by the vector w. This modeling form, often written g-1(E(Y | X=x)) = w0 + w1x1 + w2x2 + … + wpxp is called a generalized linear model. Specifying a link function, g-1, a distribution for Y and likely values for w given the training data determines the entire model.

1.6 Parametric Prediction 1-63

g-1( ) w0 + w1x1 +…+ wpxp=p

Logistic Regression Models

Training Data

log(odds)

( )p1 - p

log

logit(p)

0.0

1.0

p 0.5

logit(p )

0

The primary purpose of the link function is to match the expected target value range to the range of w0 + w1x1 + w2x2 + … + wpxp, (-∞, ∞). For example, the range of the expected value of a binary target is [0,1]. An extremely useful link function for binary targets is the logit function, g-1(p) = log (p/(1–p)). Because the expected value of a binary target is P(Y=1), or simply p, the ratio p/(1-p) is simply the odds of Y=1. In words, the logit equates the log odds of Y=1 to a linear combination of the inputs.

A generalized linear model with a binary target and a logit link is called a logistic regression model. It assumes the odds change monotonically in the direction defined by w. Because the odds change in a single direction over the entire input space, the decision boundary for standard logistic regression models is a (hyper-)plane perpendicular to w.

=( )p1 - p

log w0 + w1(x1+1)+…+ wpxp´

´w1 +w0 +w1x1 +…+ wpxp

Changing the Odds

Training Data

w0 + w1x1 +…+ wpxp=( )p1 - p

log

oddsratio

( )p1 - p

logexp(w1)

The simple structure of the logistic regression model readily lends itself to interpretation. A unit change in an input xi changes the log odds by an amount equal to the corresponding parameter wi. Exponentiating shows the unit change in xi changes the odds by a factor exp(wi). Factor exp(wi) is called the odds ratio because


it equals the ratio of the new odds after a unit change in xi to the original odds before a unit change in xi.


Building a Logistic Regression Model

The principles behind non-parametric and parametric predictive models are different. You will now see that the results provided by the models are also different.

1. Add a Regression node to the process flow diagram. Place it beneath the Tree node.

2. Draw an arrow from the Data Partition node to the Regression node.

3. Run the diagram from the Regression node and view the results. The Results – Regression window opens.

The default plot shows the relative significance of the model parameter estimates. You return to this window later.

4. Select the Output tab. The top of the model output report is displayed.


The Output report provides complete information about the regression model just fit. The two numbers at the bottom of the window are the most conspicuous.

The Number of Model Parameters equals 115. A standard regression model contains one parameter for each input. The training data contains only 50 variables, including the two potential targets and the customer ID variable. Where are the extra parameters coming from?

While the training data contains only 47 inputs, some of the inputs are categorical. Encoding a categorical variable in a parametric model requires the creation of indicator or dummy variables. Fully encoding a categorical variable with L levels requires L–1 indicator variables.

A Number of Model Parameters other than 115 indicates a misspecification of one or more input measurement scales. You should open the Input Data Source node and change the Measurement of any ordinal input to interval.

More surprising than the Number of Model Parameters is the reported Number of Observations. The training data contains more than 9,000 cases, yet the model estimation process recognizes only 3,539 of this total. Where are the missing cases?


Managing Missing Values

To solve the mystery of the missing values, look again at the training data using the Insight node.

1. Close the Results – Regression window.

2. Run the Insight node and view the results.

3. Scroll the window to view the DONOR_AGE column.

Only about 75% of the rows for DONOR_AGE contain measurements. The rest contain missing values.

The parametric models in Enterprise Miner use a case for model estimation only if it is complete (that is, it has no missing values in the model inputs). Only 3,539 of the cases in the training data are complete. There are several ways to proceed:

• Do nothing. If there are very few cases with missing values, this is a viable option. The difficulty with this approach comes when the model must predict a new case containing a missing value. Omitting the missing term from the parametric equation usually produces an extremely biased prediction.

• Impute a synthetic value for the missing value. For example, if an interval input contains a missing value, replace the missing value with the mean of the non-missing values for the input. This eliminates the incomplete case problem, but modifies the input’s distribution. This can bias the model predictions.

Making the missing value imputation process part of the modeling process allays the modified distribution concern. Any modifications made to the training data are also made to the validation data and the remainder of the modeling population. A model trained with the modified training data will not be biased if the same modifications are to any other data set the model may encounter (and the data has a similar pattern of missing values).


• Create a missing indicator for each input in the data set. Cases often contain missing values for a reason. If the reason for the missing value is in some way related to the target variable, useful predictive information is lost. The missing indicator is 1 when the corresponding input is missing and 0 otherwise. Each missing indicator becomes an input to the model. This allows modeling of the association between the target and a missing value on an input.

To address missing values in the 97NK model, impute synthetic data values and create missing value indicators.

1. Insert a Replacement node between the Data Partition node and the Regression node.

2. Open the Replacement node. The Replacement window opens.

3. Select Create imputed indicator variables. The Replacement node adds 47 missing indicators to the training data.

4. Select the menu button for Role and select Input from the list.

5. Close the Replacement window.

The defaults of the Replacement node • (for interval inputs) replace any missing values with the mean of the non-missing

values • (for categorical inputs) replace any missing values with the most frequent category.

1. Run the Replacement node and view the results. The Results – Replacement window opens.


2. Scroll the window until the DONOR_AGE column is visible. A gray background indicates replaced values.

Any predictive modeling node following the Replacement node uses this modified training. Identical modifications have been made to the validation data.

3. Scroll the window to view the column labeled Imputed indicator for DONOR_AGE.

A total of 47 new missing indicator variables have been added to the training data set. Any predictive modeling node following the Replacement node has these new inputs available for model construction.

Only those inputs with missing values have indicators with the model role set to input in subsequent nodes. The remainder have the model role set to rejected.

4. Close the Results – Replacement window.

With all missing values imputed, the entire training data set is available for building the logistic regression model.

1. Run the Regression node and view the results.

2. Select the Output tab.


The number of parameters has increased from 115 to 119. (Four missing indicators have been added to the model.) The Number of Observations has increased from 3,539 to 9,685, the total number of cases in the training data.

3. Select the Statistics tab and scroll to the bottom of the table.

The row labeled Average Profit for TARGET_B rates the performance of the model on the Training and Validation data. The average profit per case is $0.1938 on the training data and $0.1501 on the validation data. This is an increase from $0.1674 and $0.1355, respectively, calculated for the Tree model.

4. Select the Output tab again and scroll to the Type III Analysis of Effects.

The Type III Analysis tests the statistical significance of adding the indicated input to a model already containing other listed inputs. Roughly speaking, a value near 0 in the Pr > Chi-Square column indicates a significant input; a value near 1 indicates an extraneous input.


Many of the Pr > Chi-Square values are closer to 1 than they are to 0. This is evidence of the model containing many extraneous inputs. Inclusion of extraneous inputs can lead to overgeneralization and reduced performance due to the curse of dimensionality. It is desirable to tune the model to include only relevant inputs.


1.7 Tuning a Parametric Model

Entry Cutoff

Forward SelectionInput p-value

0.80

0.50

Training

Step

Profit

Validation

Parametric model are tuned by varying the number and values of model parameters. For logistic regression models, choosing the number of parameters is equivalent to choosing the number of model inputs. Thus, to optimally tune a logistic regression model requires selecting an optimal subset of the available inputs and supplying reasonable estimates of their corresponding parameters.

One way to find the optimal set of inputs is to simply try every combination. Unfortunately, the number of models to consider using this approach increases exponentially in the number of available inputs. Such an exhaustive search is impractical for realistic prediction problems.

An alternative to the exhaustive search is to restrict the search to a sequence of improving models. While this may not find the single best model, it is commonly used to find models with good predictive performance. The Regression node in Enterprise Miner provides three sequential selection methods.

Forward selection creates a sequence of models of increasing complexity. The sequence starts with the baseline model, a model predicting the overall average target value for all cases. The algorithm searches the set of one-input models and selects the model that most improves upon the baseline model. It then searches the set of two-input models that contain the input selected in the previous step and selects the model showing the most significant improvement. By adding a new input to those selected in the previous step, a nested sequence of increasingly complex models is generated. The sequence terminates when no significant improvement can be made.

Improvement is quantified by the usual statistical measure of significance, the p-value. Adding terms in this nested fashion always increases a model’s overall fit statistic. By calculating the change in fit statistic and assuming the change conforms to a chi-squared distribution, a significance probability, or p-value, can be calculated. A large fit statistic change (corresponding to a large chi-squared value) is unlikely due to chance. Therefore, a small p-value indicates a significant improvement. When

1.7 Tuning a Parametric Model 1-73

no p-value is below a predetermined entry cutoff, the forward selection procedure terminates.

Validation profit determines the best model in the forward selected sequence. For large training data sets, this is often different from the last model in the sequence.

Stay Cutoff

Backward SelectionInput p-value

0.80

0.50Step

Profit

Training

Validation

In contrast to forward selection, backward selection creates a sequence of models of decreasing complexity. The sequence starts with a saturated model, a model that contains all available inputs and, therefore, has the highest possible fit statistic. Inputs are sequentially removed from the model. At each step, the input chosen for removal least reduces the overall model fit statistic. This is equivalent to removing the input with the highest p-value. The sequence terminates when all remaining inputs have a p-value in excess of the predetermined stay cutoff.

As with the forward selection method, validation profit determines the best model in the backward selected sequence.

Stepwise SelectionInput p-value

0.80

0.50Step

Profit

Entry Cutoff

Stay Cutoff

Training

Validation


Stepwise selection combines elements from both the forward and backward selection procedures. The method begins like the forward procedure, sequentially adding inputs with the smallest p-value below the entry cutoff. After each input is added, however, the algorithm re-evaluates the statistical significance of all included inputs. If the p-value of any of the included input exceeds a stay cutoff, the input is removed from the model and re-entered into the pool of inputs available for inclusion in a subsequent step. The process terminates when all inputs available for inclusion in the model have p-values in excess of the entry cutoff and all inputs already included in the model have p-values below the stay cutoff.

Once more, validation profit determines the best model in the stepwise selected sequence.


Implementing Stepwise Selection

Implementing a sequential selection method in the regression node requires a minor change to the Regression node settings.

1. Close the Results – Regression window.

2. Double-click to open the Regression node. The Linear and Logistic Regression window opens.

3. Select the Selection Method tab. The Linear and Logistic Regression window displays sequential selection options.

4. Change the method to stepwise.

5. Close the Linear and Logistic Regression window.

6. Select Yes to save the changes. The Save Model As window opens.

7. Type Stepwise as the model name.

8. Select OK. The PVA Project window opens.

The Regression node is now configured to use stepwise selection to choose inputs for the model.



The default display of the Results – Regression window is a plot summarizing the statistical significance of inputs selected by the stepwise procedure. The model is seen to include nine parameters corresponding to eight inputs. By clicking the bars in the plot, you can identify the associated parameter. The height of the bar indicates parameter significance, and the color of the bar indicates direction of increasing effect. Red indicates an increasing effect; this means that the greater the value of the corresponding input, the more probable a donation. Thus, donation probability increases with increasing 97NK frequency status, income group, median home value, months since first gift, and recent card response proportion. Conversely, donation probability decreases with increasing months since last gift, non-pep-star status, and recent average gift amount.

2. Select the Statistics tab and scroll the display to the bottom.

The average training profit for the stepwise-selected model is lower than it was for the default saturated model. The average validation profit, however, is slightly higher. Better still, this is with a nine-parameter model instead of a hundred-parameter model.

3. Select the Output tab and scroll past the general model information.

The stepwise procedure starts with Step 0, an intercept-only regression model. The value of the intercept parameter is chosen so the model predicts the overall target mean for every case. The parameter estimate and the training data target measurements are combined in an objective function. The objective function is determined by the link function and the error distribution of the target. The value of the objective function for the intercept-only model is compared to the values obtained in subsequent steps for more complex models. A large decrease in the


objective function for the more complex model indicates a significantly better model.

Stepwise Selection Procedure Step 0. Intercept entered: The DMREG Procedure Newton-Raphson Ridge Optimization Without Parameter Scaling Parameter Estimates 1 Optimization Start Active Constraints 0 Objective Function 5445.9412054 Max Abs Gradient Element 8.082868E-12 Optimization Results Iterations 0 Function Calls 3 Hessian Calls 1 Active Constraints 0 Objective Function 5445.9412054 Max Abs Gradient Element 8.082868E-12 Ridge 0 Actual Over Pred Change 0 ABSGCONV convergence criterion satisfied. Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates -2 LOG L 10891.882 10891.882 0.000 with 0 DF (p= . ) Analysis of Maximum Likelihood Estimates Standard Wald Pr > Parameter DF Estimate Error Chi-square Chi-square Intercept 1 -1.0987 0.0235 2192.14 <.0001

Step 1 adds one input to the intercept only model. The input and corresponding parameter are chosen to produce the largest decrease in the objective function. To estimate the values of the model parameters, the modeling algorithm makes an initial guess for their values. The initial guess is combined with the training data measurements in the objective function. Based on statistical theory, the objective function is assumed to take its minimum value at the correct estimate for the parameters. The algorithm decides whether changing the values of the initial parameter estimates can decrease the value of objective function. If so, the parameter estimates are changed to decrease the value of the objective function and the process iterates. The algorithm continues iterating until changes in the parameter estimates fail to substantially decrease the value of the objective function.


Step 1. Effect FREQUENCY_STATUS_97NK entered: The DMREG Procedure Newton-Raphson Ridge Optimization Without Parameter Scaling Parameter Estimates 2 Optimization Start Active Constraints 0 Objective Function 5445.9412054 Max Abs Gradient Element 443.49853726 Actual Max Abs Over Rest Func Act Objective Obj Fun Gradient Pred Iter arts Calls Con Function Change Element Ridge Change 1 0 2 0 5349 97.4037 48.2426 0 0.967 2 0 3 0 5348 0.7691 0.2798 0 1.004 3 0 4 0 5348 0.000027 9.993E-6 0 1.000 Optimization Results Iterations 3 Function Calls 6 Hessian Calls 5 Active Constraints 0 Objective Function 5347.7683165 Max Abs Gradient Element 9.9932396E-6 Ridge 0 Actual Over Pred Change 1.0000203684 GCONV convergence criterion satisfied.

The output next compares the model fit in step 1 with the model fit in step 0. The objective functions of both models are multiplied by two and differenced. The difference is assumed to have a chi-square distribution with 1 degree of freedom. The hypothesis that the two models are identical is tested. A large value for the chi-square statistic makes this hypothesis unlikely. Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates -2 LOG L 10891.882 10695.537 196.346 with 1 DF (p<.0001)


Next, the output summarizes an analysis of the statistical significance of individual model effects. For the one input model, this is similar to the global significance test above. Type III Analysis of Effects Wald Pr > Effect DF Chi-Square Chi-Square FREQUENCY_STATUS_97NK 1 197.6350 <.0001

Finally, an analysis of individual parameter estimates is made. The standardized estimates and the odds ratios merit special attention. Analysis of Maximum Likelihood Estimates Standard Wald Pr > Parameter DF Estimate Error Chi-square Chi-square Intercept 1 -1.7070 0.0509 1126.09 <.0001 FREQUENCY_STATUS_97NK 1 0.2928 0.0208 197.64 <.0001 Analysis of Maximum Likelihood Estimates Standardized Parameter Estimate exp(Est) Intercept . 0.181 FREQUENCY_STATUS_97NK 0.177519 1.340 Odds Ratio Estimates Input Odds Ratio FREQUENCY_STATUS_97NK 1.340

The standardized estimates present the effect of the input on the log-odds of donation. The values are standardized to be independent of the input’s unit of measure. This provides a means of ranking the importance of inputs in the model.

The odds ratio estimates indicate by what factor the odds of donation increase for each unit change in the associated input. Combined with knowledge of the range of the input, this provides an excellent way to judge the practical (as opposed to the statistical) importance of an input in the model.

The stepwise selection process continues for 10 steps. After the 10th step, neither adding nor removing inputs from the model significantly changes the model fit statistic. At this point the output window provides a summary of the stepwise procedure. The summary shows the step in which each input was added and the statistical significance of each input in the final 10-input model.


NOTE: No (additional) effects met the 0.05 significance level for entry into the model. The DMREG Procedure Summary of Stepwise Procedure Effect Number Score Wald Pr > Step Entered DF In Chi-Square Chi-Square Chi-Square 1 FREQUENCY_STATUS_97NK 1 1 201.5 . <.0001 2 PEP_STAR 1 2 45.5465 . <.0001 3 INCOME_GROUP 1 3 37.1625 . <.0001 4 MONTHS_SINCE_LAST_GIFT 1 4 23.7966 . <.0001 5 MEDIAN_HOME_VALUE 1 5 16.2189 . <.0001 6 MONTHS_SINCE_FIRST_GIFT 1 6 9.8463 . 0.0017 7 RECENT_CARD_RESPONSE_PROP 1 7 10.1530 . 0.0014 8 RECENT_AVG_GIFT_AMT 1 8 7.9078 . 0.0049 9 M_INCOME 1 9 6.9136 . 0.0086 10 DONOR_AGE 1 10 6.0774 . 0.0137

Perhaps surprisingly, the 10-input model is not the model selected by the regression node as the best predictor of the target. The selected model, based on the CHOOSE=VDECDATA criterion, is the model trained in Step 8. It consists of the following effects: Intercept FREQUENCY_STATUS_97NK INCOME_GROUP MEDIAN_HOME_VALUE MONTHS_SINCE_FIRST_GIFT MONTHS_SINCE_LAST_GIFT PEP_STAR RECENT_AVG_GIFT_AMT RECENT_CARD_RESPONSE_PROP

For convenience, the output from step 8 is repeated. An excerpt from the analysis of individual parameter estimates is shown below. Analysis of Maximum Likelihood Estimates Standardized Parameter Estimate exp(Est) Intercept . 0.202 FREQUENCY_STATUS_97NK 0.108179 1.195 INCOME_GROUP 0.069488 1.080 MEDIAN_HOME_VALUE 0.055810 1.000 MONTHS_SINCE_FIRST_GIFT 0.062665 1.003 MONTHS_SINCE_LAST_GIFT -0.063314 0.972 PEP_STAR 0 . 0.900 RECENT_AVG_GIFT_AMT -0.048745 0.992 RECENT_CARD_RESPONSE_PROP 0.047769 1.592

The parameter with the largest standardized estimate is the 97NK frequency status, followed by the income group, months since last gift, and months since first gift.

The odds ratio estimates show that a unit change in RECENT_CARD_RESPONSE_PROP produces the largest change in the donation odds. Yet, this input had the smallest standardized estimate. This occurs because the range of the input is [0,1], so a unit change in the input is impossible.


Odds Ratio Estimates Input Odds Ratio FREQUENCY_STATUS_97NK 1.195 INCOME_GROUP 1.080 MEDIAN_HOME_VALUE 1.000 MONTHS_SINCE_FIRST_GIFT 1.003 MONTHS_SINCE_LAST_GIFT 0.972 PEP_STAR 0 vs 1 0.811 RECENT_AVG_GIFT_AMT 0.992 RECENT_CARD_RESPONSE_PROP 1.592

The output window ends with a misclassification table. Table of F_TARGET_B by I_TARGET_B F_TARGET_B(From: TARGET_B) I_TARGET_B(Into: TARGET_B) Frequency‚ Percent ‚ Row Pct ‚ Col Pct ‚0 ‚ Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 0 ‚ 7264 ‚ 7264 ‚ 75.00 ‚ 75.00 ‚ 100.00 ‚ ‚ 75.00 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 2421 ‚ 2421 ‚ 25.00 ‚ 25.00 ‚ 100.00 ‚ ‚ 25.00 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 9685 9685 100.00 100.00

The table assumes a 50% decision threshold and is, therefore, of little interest.


1.7 Comparing Predictive Models

Validation Data

Gains Charts

1

0

Decile

Validation p

0.5

10%

.73

20%

.66

30%

.60

40%

.54

50%

.48

60%

.40

70%

.34

80%

.24

90%

.15

100%

.07

10%

^

Comparing average profit or loss on a validation data set provides one way of comparing predictive models. Another method, a commonly used graphical technique called a gains chart, examines predictive performance independent of profit considerations. The technique partitions the validation data into deciles based on predicted probability. The average value of the target is plotted versus decile percentage (10% indicating the top 10% of predicted probabilities, 20% indicating the second highest 10% of predicted probabilities, and so on). The average target value in each decile is compared to the overall average target value. If the model is predicting well, the initial deciles (corresponding to the highest predicted probabilities) should show a high average target value, whereas the final deciles (corresponding to the lowest predicted probabilities) should show a low average target value.

Each decile corresponds to a range of predicted values. If the model is producing correct (unbiased) predictions, the average value of the target in each decile should, on the average, fall within this range of predicted values.

1.7 Comparing Predictive Models 1-83

.60.54.48.40.34.24.15.07

Validation Data

Cumulative Gains Charts

1

0

Depth

Validation p

0.5

100%50%10%

.66

.73

^

In many applications, an action is taken on cases with the highest predicted target value. A cumulative gains chart plots the average target value in the validation data versus selection depth (the proportion of cases above a given threshold). Such a chart shows the expected gain in average target value obtained by selecting cases with high predicted target values. For example, selecting the top 20% of the cases (based on predicted target value) results in a subset of data with more than 80% of the cases having the primary target level. This is 1.6 times the overall average proportion of cases with the primary target level. In general, this ratio of average target value at given depth to overall average target value is known as lift. The vertical axis of a cumulative gains chart can be described in terms of lift instead of average target value. When this is the case, a cumulative gains chart is often called a lift chart.

For a fixed depth, a model with greater lift is preferred to one with lesser lift. On the average, increasing the depth decreases the lift. The rate change for two models, however, may be different. While one model may have a higher lift at one depth, a second model may have a higher lift at another depth. The cumulative gains chart illustrates this tradeoff.


.07 .15 .24 .34 .40 .48.54.60 .66

.73

Validation Data

Sensitivity Charts

1

0

Depth

Sensitivity

0.5

100%50%10%

While a cumulative gains chart shows the tradeoff of lift versus depth, it does not provide a complete description of model performance. In addition to knowing the gain or lift at some depth, an analyst may want to know the proportion of cases of a particular target level correctly decided at this depth. For the primary target level, this proportion is known as sensitivity.

A sensitivity chart plots sensitivity versus depth. It this case, it shows the proportion of cases with the primary target level whose predicted target value exceeds fixed thresholds. Assuming each fixed threshold represents a decision threshold, this proportion is equal to sensitivity.

For a fixed depth, a model with greater sensitivity is preferred to one with lesser sensitivity. By its definition, sensitivity always increases with depth. However, as with lift, the rate increase of sensitivity for two models may be different. While one model may have a higher sensitivity at one depth, a second model may have a higher sensitivity at another depth. The sensitivity chart illustrates this tradeoff. A model that provides no lift has, on the average, a sensitivity equal to selection depth.

To summarize a model’s overall performance, analysts can examine the total area under the sensitivity curve. Models with the greatest area under the sensitivity curve have highest average sensitivity across all decision thresholds.


Validation Data

Adjusted Gains Charts

1

0

Depth

Validation p

0.5

10%

.73

20%

.66

30%

.60

40%

.54

50%

.48

60%

.40

70%

.34

80%

.24

90%

.15

100%

.07

10%

.007 .010 .015 .020 .029 .041 .047 .066 .087 Adjusted

Adjusted

0.1

0.2

^

Validation Data

Adjusted Cumulative Gains Charts

1

0

Depth

Validation p

0.5

10%

.73

20%

.66

30%

.60

40%

.54

50%

.48

60%

.40

70%

.34

80%

.24

90%

.15

100%

.07

10%

.007 .010 .015 .020 .029 .041 .047 .066 .087 Adjusted

Adjusted

0.1

0.2

^

Separate sampling complicates the construction of the preceding assessment charts. Adjustments must be made not only to the predicted probabilities, but also to the average target proportion and the depth calculations. Enterprise Miner handles the adjustments automatically as long as a prior vector has been specified.


Validation Data

Adjusted Sensitivity Charts

1

0

Depth

0.5

10%

.73

20%

.66

30%

.60

40%

.54

50%

.48

60%

.40

70%

.34

80%

.24

90%

.15

100%

.07

10%

.007 .010 .015 .020 .029 .041 .047 .066 .087

Adjusted

Sensitivity

Because calculating sensitivity involves only the primary target level, its value is not affected by separate sampling. However, because a sensitivity chart plots depth on the horizontal axis, the chart’s appearance is affected. For example, before adjustment, sensitivity at a depth of 20% is around 0.3. After adjustment, the sensitivity at the same depth is more than 0.5.

Because the sensitivity values are affected by separate sampling, so too is the total area under the sensitivity curve. It can be shown that the maximum possible area under the sensitivity curve (for large validation samples) is (1 – π1/2), where π1 is the overall average proportion of cases with the primary target level. Smaller overall average proportions increase the area possible under the curve. This limits the utility of the area statistic as a universal measure of model performance. For example, having 68% of the plot area under a sensitivity curve indicates mediocre predictive performance for π1 = 0.05, but it is close to the maximum possible for π1 = 0.5.


ROC Charts

Sensitivity

1-Specificity

Validation Data

.087.066.047.007 .010 .015 .020 .029 .041 .047 .066 .087

A slight modification to the sensitivity chart provides a chart whose appearance is invariant to separate sampling. A receiver operating characteristic (ROC) chart plots sensitivity versus 1–specificity. Specificity is the proportion of secondary target level cases correctly decided at a given decision threshold. Thus, 1–specificity is the proportion of secondary target level cases incorrectly decided at a given decision threshold. Because sensitivity and specificity are computed separately within each target level, the overall shape of the plot itself is unaffected by separate sampling methods.

The area under the ROC curve, like the area under the sensitivity curve, provides a way to assess overall model performance. Unlike the sensitivity curve, however, the total area does not depend on the overall average proportion of the primary target level. Thus, the area, sometimes referred to as a c-statistic, has become a universal measure of overall predictive model performance for binary target models. A perfect model that completely separates cases with distinct target levels has c=1. A model with no lift has c=0.5.

The area under a sensitivity curve can be obtained from the c-statistic via the formula: Area = π0c + ½(π1 + 1/N), where π1 is the overall average proportion of the primary target level, π0 = 1– π1, and N is the overall sample size of the validation data.


Cumulative Gains and Profit

0

Depth

Validation p

10%20% 40%50%60%70%80%90%100%10%

Adjusted

Adjusted

0.1

0.2

^

TargetDecision Profit

P

P 0

0

Average Profit = ((P – P )·p + P )·depth

Cumulative gains charts are often used to compare models separately from profit considerations. When one of the decision alternatives is to simply “do nothing,” a useful connection between profit and gains can be established.

Consider the expression for overall average profit for the profit structure shown in the slide:

Average Profit = (P11n11 + P10n10) / N.

Simple algebraic manipulation yields

Average Profit = [(P11 – P10)· 1p + P10]·depth

where 1p = n11/n·1 is the gain or average value of the target at depth = n·1 / N. This expression shows, for a fixed depth, the model with the highest gain or lift also has the highest overall average profit.

It is common practice for the selection depth to be mandated prior to modeling. In such a case, the most profitable model at the mandated selection depth is identically the model with the highest lift at that depth.


Adjusted

Adjusted

0.1

0.2

Cumulative Gains and Profit

0

Depth

Validation p

100%

^Average

Profit

0%

TargetDecision Profit

0

0P

P

.95

-.05

Average Profit = ((P – P )·p + P )·depth.95+.05 –.05p ·depth – .05·depth

Although it is a common practice, mandating the selection depth prior to modeling is not necessarily a good practice. Inspection of the equation for overall average profit shows that it is proportional to the area spanned by two rectangles. One rectangle, with base equal to depth and height proportional to gain, corresponds to the profit realized by making a correct decision. The other rectangle, with base equal to depth and height constant, corresponds to the loss incurred by making an incorrect decision. Overall average profit determined by (nonlinear) trade-off in area between these two rectangles.

One danger in mandating a selection depth is seen in the overall average profit plot above. Profit is maximized at a depth of 30%. Mandating a selection depth of 20% reduces overall average profit by nearly 25%. On the other hand, mandating a selection depth of 60% adds a large number of cases with no profit increase. While this might not seem as dire as underestimating the depth, doubling the number of selected cases also increases the variability in the overall average profit.


Comparing Predictive Models

In this demonstration, the two models built thus far are compared using the assessment charts discussed above.

1. Close the Results – Regression window if it is open.

2. Add an Assessment node to the diagram as shown.

3. Open the Assessment node. The Assessment Tool window opens.

The window lists the names of the models feeding into the Assessment node in addition to various fit statistics.

Much of the information displayed in the Assessment node is pre-computed during the model fitting process. Therefore, it is unnecessary to run the node to view results.

4. Select both the Tree and the Regression models.

5. Select the Draw Lift Chart tool, . The Lift Chart window opens.


The Lift Chart window displays gains charts, cumulative gains charts (the default), and sensitivity charts. There is a profit chart option, but it is incompatible with the method used to tune the predictive models and should be avoided.

1. Select the Non-Cumulative button. A gains chart is displayed.

The regression model shows a 10% donation proportion in the top decile, almost twice the overall average donation (baseline) rate of 5%. Thus, for the top decile, the regression model shows a lift of 2. In the bottom decile, the regression model shows a donation proportion of about 3%. This is about two-thirds the baseline rate.

2. Select the Lift Value button. This changes the vertical scale of the gains chart from percent response to lift. The values of the lift described above are confirmed on the vertical axis.


The tree model shows a slightly lower lift in the top decile and a slightly higher lift in the last decile. This shows the model to have slightly less discriminatory power than the regression model, but certainly more than the baseline model.

3. Select the Cumulative button. This returns to the initial view of a cumulative gains chart. However, the vertical axis is now presented in terms of lift.

The regression model is seen to have uniformly higher lift than the tree model for all depths (of course, all models have the same lift for depth=100%). The regression model concentrates more donors in the top deciles and more non-donors in the bottom deciles than the tree. This effect (seen in the gains chart) accounts for the higher lift seen in the cumulative gains charts.

4. Select the %Captured Response button. A sensitivity chart is displayed.


The sensitivity chart shows the percent of all donors expected or captured in a solicitation up to the specified depth. For example, soliciting to a depth of 30% captures half of all donors.

The regression model exhibits uniformly higher sensitivity and therefore betters the tree model for all decision thresholds. If a selection depth were mandated in advance, for example solicit the top 50% of all individuals, you could be confident that the best model to select the cases would be the regression model.

Of course mandating a selection depth does not necessarily maximize profit. For a given profit structure and predictive model, there exists an optimal depth for solicitation. How can you discover this optimal depth?

5. Select the View Lift Data tool, . The VIEWTABLE: work.liftdata window opens.

The SAS data set WORK.LIFTDATA contains all the information used to generate the assessment charts (and more). The Modeling Tool column links the model with the presented data.

6. Scroll the table vertically to the 10 rows associated with the regression model, and then scroll horizontally to the column labeled Average Profit Per Observation.

The name of this column is misleading. The quantity presented is the average profit per case within the decile. Mathematically, this can be expressed as E(Profiti) = (14.62 n11i – 0.68 n01i ) / Ni, where i indicates decile


number, n11i and n01i indicate the number of donors and non-donors, respectively, in decile i with predicted probability in excess the decision threshold, and Ni is the number of cases in decile i. In other words, this is an estimate of the average profit (per case) just for cases in the ith decile.

The seventh decile’s average profit per case equals zero. No cases in the decile have a predicted probability in excess of decision threshold. The sixth decile’s average profit per case, however, is non-zero. Therefore, the decision for some of the cases in the sixth decile is to solicit. The sixth decile contains cases with predicted values from the fiftieth to the sixtieth percentile. Therefore, the theoretically optimal depth is between 50% and 60%.


Correcting Assessment Profits (Optional)

The profits presented in the Assessment node do not match the average profit definitions used in the modeling nodes. Determining the profit consequence in making sub-optimal decisions requires some hand calculation. For example, what is the overall profit consequence in mandating solicitation to a particular depth?

To calculate this quantity you could simply refer to the formula presented in the last slide of this section:

Average Profit = [(P11 – P10)· 1p + P10]·depth.

Some analysts use the SAS System’s data export facilities to bring the assessment data into a spreadsheet and examine the average profit consequences versus decision depth.

This section provides a program that modifies the stored assessment data within Enterprise Miner, avoiding the extra steps required for exporting the data. Using the SAS Code tool in Enterprise Miner enables you to encapsulate this code in the analysis diagram.

1. Add a SAS Code node to the diagram and connect it to the Assessment node as shown.

The SAS Code node enables you to add ordinary SAS code to your process flow diagram, customizing Enterprise Miner analyses.

2. Open the SAS Code node.


The SAS Code window allows you to type or load ordinary SAS programs. These programs will run when the SAS Code node is run in a process flow.

The node settings are divided across six tabs in the SAS Code window.

The Data tab identifies the data sets currently defined for training, validation, testing, and scoring.

The Variables tab displays metadata for the variables in the training data set.


The Macros tab shows the names of macro variables available in the SAS Code node. The macro variables refer to data sets and variables in the nodes immediately preceding the SAS code node. The Description column explains the contents of each macro variable.

The Program tab shows the code that will be executed by the SAS Code node.

The Exports tab lists the data sets available to successor nodes.


The Notes tab provides a facility for documenting node activities.

3. Select the Program tab.

4. Select File Import File from the SAS menu bar.

5. Select and open the SAS program file Adjust assessment profits.sas. The program file should be located in the same directory as the PVA_RAW_DATA data set. The program is loaded into the SAS Code window.

The program calls a SAS Component Language (SCL) module that adjusts the profits presented in the Assessment node. The details are omitted, but the basic idea of the program is

1. read the Enterprise Miner metadata

2. locate the pre-computed assessment information for each model connected to the assessment node

3. modify the profit and cumulative profit fields to correctly compute overall average profit for each decile and each selection depth.

The program is designed to work with binary targets, and it accepts the most general 2x2 profit matrix. This profit matrix is specified on the first two lines of the program and has default values corresponding to those used in the course.

To run correctly, a results data set must be defined.


1. Select the Exports tab. The Exports tab defines the data generated by the SAS Code node and passed to other nodes in the diagram.

2. Uncheck Pass imported data sets to successors.

3. Select the Add button. The Add export data set/view window opens. This window lets you define the role of the exported data.

4. Select the pop-up menu and then select Results OK. A results data set definition is added to the Enterprise Miner metadata.

5. Close the SAS Code node and save the changes.

6. Run the SAS Code node. You need not view the results.

The SAS Code node modifies the Cumulative Profit Per Observation and the Average Profit Per Observation using the supplied profit information to be consistent with standard usage.

1. Open the Assessment node and view a lift chart as before.

2. Select the Profit button. A plot of overall average profit versus depth is displayed.


As is typical with conforming profit matrices, the overall average profit for both models increases with depth, reaches a maximum, and then diminishes. The plot shows the regression model achieving a maximum profit somewhat below the theoretical cutoff found earlier to be between 50% and 60%. From a practical point of view, it seems soliciting anywhere from 30% to 60% of the cases will give similar profits using the regression model. A mandated cutoff of say 20%, however, would reduce profits be about 25%.

The baseline model shows the overall average profit for soliciting, at random, the indicated percentage of cases. In this case, soliciting a random selection of donors will result in a positive profit. Ignoring individuals who would, on the average, yield a positive profit is a sub-optimal strategy. Thus the maximum overall average profit is realized when all cases are solicited.

3. Select the Non-Cumulative button. The plot displays the decile-to-decile change in overall average profit.

Unlike the original plot, this overall average is relative to the entire population not to the individual decile. The value plotted represents the decile’s contribution to the overall average shown in the cumulative profit plot.

Many of the intermediate deciles contain a mixture of both donors and non-donors that result in little additional profit (or loss). This results in maximum profit for the regression model across a wide range of values and offers some flexibility in deciding the final solicitation depth. If a secondary goal of the analysis, after maximizing profit, is to stay in contact with as many potential donors as possible, you can solicit more than the theoretical optimum with apparently little overall profit consequence. On the other hand, if a secondary goal is to minimize up-front solicitation costs, reducing the size of the solicitation somewhat from the theoretical optimum likewise has little overall profit consequence.

While this flat maximum may seem to be desirable, it is also consistent with a predictive model that is unable to sharply separate the donors from the non-donors. The lack of sharp separation may be due to either an inadequate model or a basic inseparability in the target level. In Enterprise Miner, the best way to identify an inadequate model is to try many types of models and see which one has the best generalization.

1.8 Deploying a Predictive Model 1-101

1.8 Deploying a Predictive Model

Deploying a Predictive Model

?Expected

TargetValue

Input MeasurementsScoring CodeFrom

Predictive Model

After training and comparing predictive models, one model is selected to represent the association between the inputs and the target. Once selected this model must be put to use. A scoring recipe generated from the fitted model and applied to suitable input measurements accomplishes this deployment.

Deployment Options

Scoring CodeModule

ScoredData Set

?

Enterprise Miner offers two options for model deployment: scoring code modules and scored data sets.

Scoring code modules are used to generate predicted target values in environments outside of Enterprise Miner. Release 4.1 of Enterprise Miner can create scoring code in the SAS and C programming languages. Future versions will include additional languages. The SAS language code can be embedded directly in a SAS application to generate predictions. The C language code must be compiled. The C code should


compile with any C compiler that supports the ISO/IEC 9899 International Standard for Programming Languages -- C.

Invoking the SAS language scoring code from within Enterprise Miner achieves the second deployment option: scored data sets. Using an Input Data Source node, you identify a data set with the required input variables. Enterprise Miner generates predictions using the scoring recipe prescribed by the selected model. Using a SAS Code node, the resulting scored data set can be copied from the Enterprise Miner environment to another SAS data set, a flat file, a SAS data warehouse, or a specified DMBS.

A copy of the scored data set is stored in the private intermediate data repository of Enterprise Miner. If the data set to be scored is very large, you should consider scoring the data outside the Enterprise Miner environment.


Creating a Scored Data Set

Enterprise Miner creates scored datasets using a combination of Input Data Set, Score, and SAS Code nodes.

1. Add an Input Data Source, Score, and SAS Code nodes as shown below.

Connecting the Score node to the Regression node selects the Regression model for deployment.


3. Select PVA_SCORE_DATA from the CRSSAMP library.

PVA_SCORE_DATA contains 96,367 cases and 48 variables. The cases are independent from the PVA_RAW_DATA used to train the models. The data set also lacks TARGET_B and TARGET_D.

4. Change the role of the data set to score.

5. Close and save changes to the Input Data Source node.

By default, the Score node is inactive. You must configure the node to score incoming datasets.

1. Open the Score node. The Score window opens with the Settings tab selected.


2. Select Apply training data score code to score data set. The Score node recognizes any incoming data set with the role of SCORE and applies the scoring recipe to each case in the dataset.

3. Select the submit button, , on the tool bar to create the scored data set.

4. Select the Log tab to review the scoring process. The log contains several items of interest.

The SCORING REPORT, excerpted below, notes differences between the training/validation and the scoring datasets. In this case, none of the reported differences have consequences for the generation of valid predictions.

======================================================================== NOTE: SCORING REPORT NOTE: Training variable not found in score data: TARGET_D NOTE: Training variable has different level in score data: INCOME_GROUP = interval NOTE: Training variable has different level in score data: FREQUENCY_STATUS_97NK = interval NOTE: Training variable not found in score data: M_MONTHS NOTE: Training variable not found in score data: M_DONOR_ NOTE: Training variable not found in score data: M_INCOME NOTE: Training variable not found in score data: M_MOR_HI NOTE: Training variable not found in score data: M_WEALTH

The actual application of the scoring recipe follows the scoring report. The code shown has several differences from the raw scoring code initially found in the node. These differences are discussed in the next demonstration, describing the creation of a scoring module.

Scroll to the end of the Log window. The log confirms the creation of a scored dataset. NOTE: There were 96367 observations read from the data set EMDATA.VIEW_2ZI. NOTE: The data set EMDATA.SD_X8VVR has 96367 observations and 108 variables. NOTE: DATA statement used: real time 15.82 seconds cpu time 2.72 seconds


It is stored in the private intermediate data repository of Enterprise Miner. The final task is to place the scored code in a user-defined location.

1. Close and save changes to the Score node.

2. Open the SAS Code node connected to the Score node.

3. Select the Data tab. The Data tab contains information about the various datasets passed into the SAS Code node.

4. Select the Score button.

5. Select the Properties… button. The Data set details window opens.

6. Select the Table View tab. The table shows the original variables in the dataset, those created by the data replacement process, and the variables resulting from the application of the regression model.

Many of the columns resulting from application of the regression are missing. For most of these, there is no way to compute their values in the absence of actual target values. An exception is the WARNINGS column. It will be missing unless problems were encountered during the application of the scoring code. Possible warning codes include M for missing (un-imputed) input, U for unrecognized input category, and P for invalid posterior (predicted) probability. Cases with non-missing WARNINGS can be investigated to determine the cause of the warning.

The scoring code creates a variable labeled Decision: TARGET_B. This variable corresponds to the decision defined in the profit matrix. Selecting


cases with the decision variable equal to 1 selects all the cases for which the optimal decision is solicit.

Finally, for categorical targets, the scoring code creates a variable for each target level. The value reported for each case is the predicted value of the specified target level.

By default, the table view presents the variable labels created by the scoring code. Shortly, you will need the variable names.

7. Uncheck the Variable labels checkbox. The column headers now contain the names of the variables rather than their labels.

Note the names of the predicted value variables. They are formed by placing the characters P_ before the target variable name and the target level being predicted after the target variable name. Similarly the name of the decision variable is formed by placing a D_ before the target variable name and an _ after the target variable name.

8. Close the Data set details window.

9. Select the Program tab.

10. Select File Import File.

11. Select the program create score view.sas. libname MYLIB 'path'; data MYLIB.MY_SCORE_NAME/view=MYLIB.MY_SCORE_NAME; set &_SCORE; keep CONTROL_NUMBER _WARN_ D_TARGET_B_ P_TARGET_B1; run;

The program places a view to the scored data set named MY_SCORE_NAME in the directory specified by path. The SAS macro variable &_SCORE points to the data set within the private data repository of Enterprise Miner. The KEEP statement specifies the columns to be kept.

The creation of a view avoids duplication of the scored data.


A simple modification to the above program selects only those records for which the optimal decision is solicit. data MYLIB.MY_SCORE_NAME; set &_SCORE; if D_TARGET_B_='1'; keep CONTROL_NUMBER _WARN_ P_TARGET_B1; run;

Similarly, to examine cases with warnings, use the following variation: data MYLIB.MY_SCORE_NAME; set &_SCORE; if _warn_ ne ''; keep CONTROL_NUMBER _WARN_ P_TARGET_B1; run;


Exporting Results to an ODBC Data Source (Optional)

By installing SAS/ACCESS software and slightly modifying the above program, you can write directly to many popular DBMS tables.

1. From the Program tab of the SAS Code node, select File Import File.

2. Select export score data via ODBC.sas. The Program tab displays the following program.

libname MYLIB ODBC dsn='dsn name'; proc datasets library=MYLIB; delete MY_SCORE_NAME; run; quit; data MYLIB.MY_SCORE_NAME; set &_SCORE; keep CONTROL_NUMBER _WARN_ D_TARGET_B_ P_TARGET_B1; run;

This program is very similar to create score view.sas. The LIBNAME statement specifies an ODBC-compliant data source named dsn name. The DATASETS procedure deletes the table MY_SCORE_NAME in dsn name (if it already happens to exist; otherwise the DATASETS procedure has no effect. The DATA step code has been modified to create a table rather than a view.

To test the above program, you need to create an ODBC-compliant data source. You can create an ODBC-compliant data source via Windows’ Control Panel. The details are omitted.


Creating a Scoring Code Module

Scoring code modules are created by adding a Scoring tool to a model node in a process flow diagram. This demonstration shows how to create a SAS language scoring code module. Creating a C language scoring code module is similar.

1. Open the Score node.

2. Select the Score Code tab. A list of raw score code objects from predecessor nodes appears on the left.

3. Double-click the Regression raw score code object.

The raw scoring code is usually unsuitable for deployment: it needs rearranging and augmentation to run properly. Saving the raw scoring code makes the changes required to produce an operational scoring module.

4. Right-click the regression model and select Save… from the pop-up menu. The Score code file window opens, prompting for a name for the rearranged and augmented scoring code.

5. Type Saved regression source file and select OK. The Score window updates to show a list of saved source files.

The main differences between the raw code and the saved scoring code module occur at the beginning of the code. The saved scoring code begins with the definition of macro functions. These macro functions are used to standardize variable names later in the code. In addition to the macro definitions, any preprocessing of data before the main scoring code is moved from inside the raw code to here.


/*--------------------------------------------------------------*/ /* ENTERPRISE MINER: BEGIN SCORE CODE */ /*--------------------------------------------------------------*/ %macro DMNORLEN; 16 %mend DMNORLEN; %macro DMNORMCP(in,out); &out=substr(left(&in),1,min(%dmnorlen,length(left(&in)))); &out=upcase(&out); %mend DMNORMCP; %macro DMNORMIP(in); &in=left(&in); &in=substr(&in,1,min(%dmnorlen,length(&in))); &in=upcase(&in); %mend DMNORMIP;

Next comes the main DATA step invocation of the scoring module. All calculations of predicted values occur after this line of code. DATA &_PREDICT ; SET &_SCORE ;

The DATA step requires definition of two macro variables: &_SCORE and &_PREDICT. These macro variables contain the names of the data set to be scored and the data set to be created, respectively.

* CODE_CLEAN * ; * Code substitution: ARRAY RGDRF->A7307; * Code substitution: ARRAY RGDRU->A6048; * Code substitution: GOTO RGDR1->G9076; * Code substitution: ARRAY RGDEMA->A04090; * Code substitution: GOTO RGDEEX->G69237; * Code substitution: ARRAY RGDEBE->A10802; * ;

The raw score code includes ambiguous references to arrays and program labels. These references are made unique by a code-cleaning algorithm applied to the raw score code. The Code Clean comments define the substitutions made to the original array names and program labels.

The remainder of the scoring code follows the Code Clean comments. Each node in a diagram path generates a code block. The Input Data Source Node and the Data Partition node do not produce scoring code. The code blocks corresponding to these nodes are empty. The code block for the Replacement node provides instructions for replacing the missing values in the input variables. It also creates missing value indicator variables.

You can simplify the scoring code by deleting replacement code for variables not used in the regression model.

The final code block corresponds to the regression model. The code starts with the generation of dummy values for any categorical variables used in the model. Next, all inputs are checked for missing values. If a missing value is found in an input used


by the model, the _WARN_ variable is set to M and the predicted value of the target variable is set to the target’s overall average value. Next comes the actual calculation of the predicted target value using the formula generated by the regression model. From the predicted target value, residuals are calculated. The model predicted probabilities are adjusted to account for priors. The decision variable and other assessment variables are computed. The RUN and QUIT statements finish the scoring module.

The displayed source code is now suitable for deployment in base SAS software. However, it is stored as a SAS catalog entry in the private data repository of Enterprise Miner. To save the program as a simple text file in a known location, complete the following steps:

1. Select File Save As… from the menu bar.

2. Browse to a suitable location, enter a file name (for example, PVA_REGESSION) and select OK.

The scoring code module is now stored as a SAS program.

A simple three-line program is sufficient to create a scored data set named WORK.PVASCORE from the raw data called CRSSAMP.PVA_SCORE_DATA. %let _predict= WORK.PVASCORE; %let _score=CRSSAMP. PVA_SCORE_DATA; %include “path\PVA_REGRESSION”;

DROP= and KEEP= options can be added to the _PREDICT macro variable statement to limit the amount of information placed in the prediction dataset.


1.9 Summarizing the Analysis

Process Flow Diagram SummaryInput Data Source• Select training data• Define metadata• Set prior vector• Set profit matrix

In this chapter, you created a diagram that can be used as a template for many predictive modeling tasks. As you built the diagram, many changes were made to the default settings of Enterprise Miner tools. These slides summarize the changes made to each node.

Section 1.2 describes the data selection process. Section 1.3 shows how to define a diagram’s metadata. The prior vector is set in Section 1.4. Setting a profit matrix is detailed in Section 1.5.

Process Flow Diagram SummaryData Partition• Set partition sizes• Stratify on target

Setting the data partition node is described in Section 1.3.

1.9 Summarizing the Analysis 1-113

Process Flow Diagram SummaryTree• Adjust viewer depth

Adjusting the viewer depth of the Tree node is described in Section 1.3. The effects of priors and profits on the Tree algorithm are shown in Sections 1.4 and 1.5. Details of the Tree tool algorithm and their use are discussed in Chapter 2.

Process Flow Diagram SummaryInsight• Use entire dataset• View distribution

Section 1.5 introduces the Insight tool for data visualization. Additional uses of the tool are found in Chapter 2.


Process Flow Diagram SummaryReplacement• Indicate imputation

Adding imputation indicators is shown in Section 1.6. Additional replacement methods are discussed in Chapter 2.

Process Flow Diagram SummaryRegression• Set input selection

Section 1.6 introduces the Regression tool. Tuning a regression model by input selection is detailed in section 1.7. Additional options for the Regression tool are discussed in Chapter 3.


Process Flow Diagram SummaryAssessment• View lift plot• View lift data• View modified profit

Process Flow Diagram SummarySAS Code I• Import SAS code• Add results data set• Don’t pass imports

The Assessment node and modifications to profit data are described in Section 1.8.


Process Flow Diagram SummaryInput Score Data• Select score data• Set score role

Process Flow Diagram SummaryScore • Set score action• Save source code• Export to text file


Process Flow Diagram SummarySAS Code II • Define library• Export scored data• Check warnings

Creating a scored data set and a SAS scoring module is discussed in Section 1.8.


Reporting Results

Enterprise Miner provides a means to summarize your modeling reports in a Web document. The Reporter tool captures node settings and allows colleagues to share modeling results.

1. Add a Reporter node to the diagram as shown.

2. Run the diagram from the Reporter node. When complete, you are asked whether you want to open the report.

3. Select Open…. Enterprise Miner attempts to open the HTML report file in your default Web browser.


The report provides an image of the diagram (as it appears in the Diagram Workspace) at the time the report was generated. It captures node settings and analysis results. When connected to a Score node, a hyperlink to the saved score code is provided. This provides a mechanism for distributing and preserving scoring code modules.

By default the report files are stored in the project directory. You can change the default directory by opening the Reporter node and selecting the Options tab.

Chapter 2 Flexible Parametric Models

2.1 Defining Flexible Regression Models...........................................................................2-3

2.2 Constructing Neural Networks....................................................................................2-14

2.3 Deconstructing Neural Networks................................................................................2-25

2-2 Chapter 2 Flexible Parametric Models

2.1 Defining Flexible Regression Models 2-3

2.1 Defining Flexible Regression Models

( )p1 - p

log w0 + w01x1 + w02x2=

Standard Logistic Regression Models

Training Data

A standard logistic regression model assumes the logit(p) is a linear combination of the inputs. This causes the logit(p) to increase in a direction specified by the model weights. The decision boundary for such a model is a plane perpendicular to this direction of increase. This is an extremely restrictive assumption that works remarkably well in practice. Even when the assumption is wrong, such a model may give useful predictions.

( )p1 - p

log w0 + w01x1 + w02x2=

Polynomial Logistic Regression Models

Training Data

+ w11x1·x1 + w22x2 ·x2

+ w12x1·x2

However, an incorrectly specified model never generalizes as well as a correctly specified one. If the association between the inputs and the target is not a linear combination of the inputs, you want to reflect this in your predictive model. This is the goal of flexible parametric modeling.


One of the simplest flexible parametric approaches involves enhancing a standard regression model with nonlinear and interaction terms to create a polynomial regression model.

In polynomial regression, a typical nonlinear modeling term is the square of an input, for example x1· x1. A typical interaction term is a product of two inputs, x1· x2. Adding all two-way combinations of inputs yields a quadratic regression model. Quadratic regression models are much more flexible than standard regression models. The flexibility comes at a price: with p inputs in the model, there are p·(p + 1)/2 two-way input combinations. This tends to rapidly overwhelm regression modeling procedures.


Defining Nonlinearities and Interactions

Standard logistic regression models assume a linear and additive relationship between inputs and the logit of the target. While this assumption suffices for many modeling scenarios, ignoring any existing non-linearities and interactions reduces model performance. This demonstration shows how to modify a standard model to account for known nonlinearities.

First, store your work from Chapter 1. You use it later in Chapter 3.

1. Open the PVA Analysis diagram.

2. Select File Save diagram as… from the SAS menu. The Save Diagram As… window opens.

3. Type PVA Analysis Chapter 1 and select OK. Enterprise Miner copies each element of the original PVA Analysis into the new diagram.

4. Reopen the PVA Analysis diagram.

A copy of the PVA Project is stored in the EM Project directory. You can open this project and used the completed analysis for Chapter 1 as a basis for the subsequent work in this chapter. However, be advised that this backup copy of the PVA Analysis must be re-run to restore the project’s intermediate data.

Modify the diagram for additional modeling. Certain elements of the analysis from Chapter 1 are unnecessary here.

1. Right-click on the nodes indicated below and select Delete from the pop-up menu.

2. Connect another Regression node to the Replacement node.

3. Type the name Polynomial Regression below the new node.


To build a polynomial regression model, you must explicitly define the non-linearities and interactions for use in the model. This is done through the interaction builder.

1. Open the Polynomial Regression node.

2. Select Tools Interaction builder… from the SAS menu. The Interaction Builder window opens.

On the left is a list of all available inputs. On the right is a list of terms already in the model. Suppose you believe the relationship between the 97NK frequency status and the logit of the target is both non-linear and affected by the length of time since last donation. You can model this as follows.

1. Select FREQUENCY_STATUS_97NK from the Input Variables list.

2. Select the Polynomial button. The square of FREQUENCY_STATUS_97NK is added to the bottom of the Term in Model list.


3. Select FREQUENCY_STATUS_97NK from the Input Variables list.

4. While pressing the Ctrl key, select MONTHS_SINCE_LAST_GIFT from the Input Variables list.

5. Select the Cross button. The interaction between the selected variables is added to the model and listed at the bottom of the Terms in Model list.

6. Select OK to close the Interaction Builder window.

The Variables tab of the Regression node lists the newly added modeling terms at the bottom of the variables list. The added terms have the Model Role of interaction, a role unique to the Regression node.

Presently, all inputs will be used in the fitted regression model. As seen in Chapter 1, such a model will exhibit poor generalization. An input selection method remedies this problem.

1. Select the Selection Method tab.

2. Select the stepwise method.

You are now ready to fit and evaluate the polynomial model.

1. Close and save changes to the Linear and Logistic Regression window.

2. Name the model Polynomial when prompted.

3. Run the Polynomial Regression node and view the results.


4. Select the Statistics tab. The overall average profit on the validation data is about a quarter of a cent higher than the standard regression model. This is not an impressive difference.

5. Select the Output tab and scroll to the final model summary. The summary shows that the interaction term was included in the final model but the quadratic term was not.

The effect of the interaction term on the model is summarized in the exp(Est) table. Every additional month since the last gift reduces the effect of the frequency status input by less than 2%.

Parameter exp(Est) frequency_status_97nk 1.588 frequency_status_97nk* months_since_last_gift 0.983

6. Close the Results-Regression window.

Comparing the models in the Assessment node also shows the models to be quite similar.

1. Connect the Polynomial Regression node to the Assessment node. To improve the diagram’s appearance, the Assessment node and the SAS Code node have been moved down slightly.

2. Run the diagram from the Assessment node and view the results.

3. Select the three models listed and create a lift chart.

4. Select Format Model Name from the SAS menu to distinguish the two regression models.


Both the gains chart (%Response), shown above, and the sensitivity chart (%Captured Response) show little difference in the two regression models.

By running the SAS Code node attached to the Assessment node, you can correct the profit calculations and see the very slight profit advantage for the polynomial regression model.

In general, it is difficult to realize substantial profit gain by guessing at appropriate interaction and non-linear terms. Having domain knowledge and studying empirical logit plots helps, but large data sets often defeat these methods.


Flexing Standard Regression Models

While it is difficult to guess at useful interaction terms in a regression model, it is also difficult to automatically discover them as well. For example, in a 50-input data set, there are more than 1,200 possible interaction terms. With so many combinations to choose from, most sequential procedures select many spurious, training-data- dependent terms for the final model. In short, the curse of dimensionality allows for too many input combinations in a polynomial regression model.

Situations like this are usually addressed by one of two approaches. Either restrictions are placed on the model search to make it more tractable or the modeling method is abandoned in favor of another technique. This demonstration illustrates one of many possible restrictions on the model search. The standard regression model fit in the previous chapter is expanded to include all second order polynomial terms. Another sequential selection method is applied to select useful polynomial terms.

By reducing the number of inputs used to create the interactions, the process once more becomes tractable. Unfortunately, this process uses only those inputs found important in the first regression. There is a good chance that many interactions will be missed. On the other hand, increasing the flexibility of a standard regression model may improve predictive performance.

1. Delete the connection between the Replacement node and the Polynomial Regression node.

2. Draw a connection from the Regression node to the Polynomial Regression node.

3. Open the Polynomial Regression node. Most of the inputs in the Variables tab have a Model Role of rejected and a Status of don’t use.


This results from the stepwise selection used in the first Regression node. Only inputs selected by the procedure have their Model Roles preserved. The rest have their Model Roles set to rejected.

1. Open the Interaction Builder window. The terms in the model are the same as the terms in the original Regression model except for the two terms added in the previous demonstration.

2. Select and remove the interaction and quadratic terms previously added.

3. Select the first eight inputs (INCOME_GROUP through MONTHS_SINCE_FIRST_GIFT) in the Input Variables list.

4. Select the Expand button. All 28 possible two-way interactions are added to the model.

5. Select the first eight inputs again.

6. While pressing the Ctrl key, deselect PEP_STAR. The Polynomial button activates.

7. Select the Polynomial button. Seven squared terms are added to the model.

8. Select OK to close the Interaction Builder window.

There are 43 potential inputs for use in the model. A subset of these will be selected using the stepwise process configured in the previous demonstration.

1. Run the Polynomial Regression node and view the results.

2. Select the Statistics tab. The overall average profit on the validation data is more than half a cent (almost 4%) higher than the standard regression model. While not earth shattering, it is an improvement.

3. Select the Output tab and scroll to the end of the report. All but one of the terms added to the model are interactions. Because some of the inputs are involved in more than one interaction, interpretation of the model has become extremely difficult.

This is a typical property of flexible models. Increasing the capacity of a model to capture complex input/target associations reduces interpretability of the results.


Analysis of Maximum Likelihood Estimates Standard Wald Pr > Stndzd Parameter DF Estimate Error Chi-square Chi-square Estimate Intercept 1 -1.7412 0.0831 438.54 <.0001 . frequency_status_97nk 1 0.5072 0.0517 96.28 <.0001 0.307553 frequency_status_97nk* 1 -0.0202 0.00289 48.91 <.0001 . months_since_last_gift income_group*months_ 1 0.000776 0.000154 25.44 <.0001 . since_first_gift median_home_value*months_ 1 6.851E-6 1.292E-6 28.10 <.0001 . since_last_gift months_since_first_gift* 1 0.00133 0.000560 5.63 0.0176 . pep_star 0 months_since_first_gift* 1 -0.00013 0.000036 13.19 0.0003 . recent_avg_gift_amt months_since_first_gift* 1 0.00908 0.00189 23.17 <.0001 . recent_card_response_prop recent_avg_gift_amt* 1 -0.0114 0.00250 20.75 <.0001 . pep_star 0

1. Close the Regression node and examine the Assessment node plots.

The Polynomial Regression model shown has a slightly higher lift between 40% and 50%, the most likely mailing depths.

2. Close the Regression node and run the SAS Code node, if desired.


As expected, the overall average profit is seen to be slightly higher for the Polynomial Regression model. Such a difference may not be large enough to justify the increase in model complexity resulting from the inclusion of interaction terms.


2.2 Constructing Neural Networks

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

Neural Network Model

Training Data

x2

x1

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

tanh(x)

x0

-1

1

With their exotic sounding name, neural network models (formally multi-layer perceptrons) are often regarded as a mysterious and powerful predictive modeling technique. The most typical form of the model is, in fact, a natural extension of a regression model.

A neural network can be thought of as a generalized linear model on a set of derived inputs. These derived inputs are themselves a generalized linear model on the original inputs. The usual link for the derived input’s model is inverse hyperbolic tangent, a shift and rescaling of the logit function.

What makes neural networks interesting is their ability to ability to approximate virtually any continuous association between the inputs and the target. You simply need to specify the correct number of derived inputs.

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

Neural Network Diagram

Training Data

x2

x1

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

x2

x1

Inputs

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

H1

H2

H3

Hidden layer

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

p

Target

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

2.2 Constructing Neural Networks 2-15

Multi-layer perceptron models were originally inspired by neurophysiology and the interconnections between neurons. The basic model form arranges neurons in layers. The first layer, called the input layer connects to a layer of neurons called a hidden layer, which, in turn, connects to a final layer called the target, or output, layer. The structure of a multi-layer perceptron lends itself to a graphically representation called a network diagram. Each element in the diagram has a counterpart in the network equation.

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

Neural Network Training

Training Data

x2

x1

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

0 10 20 30 40 50 60 70

Objective function (w)

As with all parametric models, the fundamental task with a fixed model structure is to find a set of parameter estimates that approximate the association between the inputs and the expected value of the target. This is done iteratively.

The model parameters are given random initial values, and predictions of the target are computed. These predictions are compared to the actual values of the target via an objective function. The actual objective function depends on the assumed distribution of the target, but conceptually the goal is minimize the difference between the actual and predicted values of the target. An easy-to-understand example of an objective function is the mean squared error (MSE) given by

2))ˆ(ˆ(1 ∑ −=

casestraining

il yyN

MSE w

where

N is the number of training cases.

yi is the target value of the ith case.

iy is the predicted target value.

w is the current estimate of the model parameters.

Training proceeds by updating the parameter estimates in a manner that decreases the value of the objective function.


( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

Neural Network Training Convergence

Training Data

x2

x1

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

0 10 20 30 40 50 60 70


Training concludes when small changes in the parameter values no longer decrease the value of the objective function. The network is said to have reached a local minimum in the objective.

( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

Training Overgeneralization

Training Data

x2

x1

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

0 10 20 30 40 50 60 70


A small value for the objective function, when calculated on training data, need not imply a small value for the function on validation data. Typically, improvement on the objective function is observed on both the training and the validation data over the first few iterations of the training process. At convergence, however, the model is likely to be highly overgeneralized and the values of the objective function computed on training and validation data may be quite different.


( )p1 - plog w00 + w01H1 + w02H2 + w03H3=

Neural Network Final Model

Training Data

x2

x1

tanh-1( H1 ) = w10 + w11x1 + w12x2

tanh-1( H2 ) = w20 + w21x1 + w22x2

tanh-1( H3 ) = w30 + w31x1 + w32x2

0 10 20 30 40 50 60 70

Profit

To compensate for overgeneralization, the overall average profit, computed on validation data, is examined. The final parameter estimates for the model are taken from the training iteration with the maximum validation profit.


Constructing a Neural Network

A small improvement in overall average profit was observed when shifting from a standard regression model to a second order polynomial regression model. Neural network models offer even more flexibility than polynomial regressions. Will they improve predicted profit even more?

1. Connect a Neural Network node to the Replacement node.

2. Open the Neural Network node. The Neural Network window opens showing a list of inputs for use in the model.

3. Select the General tab.

The most important element of the General tab is the Advanced user interface checkmark. It controls the method in which neural network models are defined. For the demonstrations in this section, the network models are defined using the Basic user interface.

4. Select the Basic tab. The Basic user interface controls the type of neural network fit, its complexity, and basic properties of the parameter estimation or training process.


5. Select the Network architecture pop-up control. The Set Network Architecture window opens.

The number of hidden neurons determines model complexity. By default this number is determined by both the number of observations in the data and the estimated amount of noise.

With a minimal understanding of the basic structure of neural network models, it is usually possible to manually choose a network complexity that outperforms the automated defaults.

For example, a network with a single hidden unit (the apparent default here) is largely the same as a standard regression model. A minimum of three hidden units is required before substantial differences from second order polynomial regressions model are possible.

6. Select the Hidden neurons pop-up control and then select Set number… from the pop-up menu. This option enables you to specify the number of hidden units used in the network.

7. Type 3 in the Set number… field.


The Direct connections and Network architecture fields enable you to modify the standard neural network architecture by defining a skip layer network and other network types. Neither of these options will be used in this course.

8. Select OK to close the Set Network Architecture window.

9. Close the Neural Network window, save the changes, and name the model Standard.

10. Run the Neural Network node. Unlike most other Enterprise Miner tools, the Neural Network node plots training progress in a separate window. This shows how the value of the objective function changes for each step in the optimization process. Lines are plotted for both the training and the validation data.

In typical fashion, the lines are initially close together and then begin to diverge. As training progresses, the neural network begins to model features only present in the training data and not in the validation data. This characterizes an overgeneralized model.

1. View the training results when complete. The Results-Neural Network window opens.

The default view of the training window is the Tables tab. At the top of the list in the tables window is the overall average profit for training and validation data. The values are widely disparate. This provides even more evidence of model overgeneralization.

2. Select the Plot tab.


The plot shows the value of the objective function versus the iteration step. It is the same plot produced during training, but it appears here with much improved scaling. With each iteration, the current model parameter estimates are used to evaluate model performance on the training and the validation data. Lower values for the objective function are better.

Note that for each step, the objective function improves on the training data (after all, it is the quantity being optimized). However, the objective function evaluated on the validation data improves only briefly at the start of the optimization process. It then diverges rapidly from the training line and steadily climbs until the end of the optimization.

Curiously, the indicated iteration does not seem to correspond to the smallest value of the objective function evaluated on the training data. This observation is correct. While the optimization process seeks to minimize the objective function, the actual model selection process is based on expected profit for the validation data. You can reconfigure the plot to show the expected profit versus iteration.

3. Right click in the plot area.

4. Select Profit from the pop-up menu. The Plot tab is updated to show overall average profit versus iteration for the training and validation data sets.

Note the maximum validation profit corresponds to the selected iteration. Both the training and validation profit values also correspond to the values reported in the assessment table examined initially.


The primary reason for poor generalization is the use of too many inputs in the neural network model. The fact that this model can achieve a level of performance equal to that of the standard regression model speaks to the ability of a neural network to filter out irrelevant and redundant inputs. Nevertheless, by filtering out these inputs from the model, you may improve model performance.


Filtering Network Inputs

To improve the performance of regression models, an input selection method like the stepwise method is commonly employed. While heuristic selection procedures like stepwise are known for neural networks, the computational costs of their implementation taxes even the fastest computers. These procedures, therefore, are not part of Enterprise Miner.

The input selection problem is similar to that for the polynomial regression model in Section 2.1. There, the problem was addressed by using the inputs selected by the standard regression model as inputs for the polynomial regression model.

A similar approach is used here. However, given a neural networks ability to model complex associations between the inputs and the target, you may also want to try another modeling method for selecting inputs. A decision tree is another commonly used input selection model. In this section, you compare the generalization of the two input selection methods.

1. Disconnect the Neural Network node from the Replacement node.

2. Move the Polynomial Regression model over to make room for the Neural Network.

3. Connect the Neural Network node to the Regression node as shown.

4. Open the Neural Network node. Only the inputs selected by the Regression node have a Model Role of input and a Status set to use.

5. Run the Neural Network node and view the results.

The Tables tab shows performance similar to that of the Polynomial Regression model. This is not exceptionally surprising given that both models are flexible and use the same inputs.

1. Connect the Neural Network node to the Assessment node.

2. Open the Assessment node and examine the gains and sensitivity charts. The charts show a very similar response profile for neural network, stepwise regression, and polynomial regression models.


Now try the same process on the inputs selected by the decision tree.

1. Connect a Replacement node the Tree node. A neural network still needs complete cases, even if a tree model is selecting the inputs.

2. Connect another Neural Network node to the Replacement node.

3. Label the node Tree Neural Network.

4. Open the Tree Neural Network node and configure a Multi-layer Perceptron network with three hidden units.

5. Run the Tree Neural Network node and view the results. The validation profit is more than one cent less than the original Neural Network model. The primary reason for this deficiency is the linear and (almost) additive association between the important interval inputs and the target. Properly specified regression models are better attuned to such an association. If the drivers of the prediction model were categorical variables or interaction terms, the Tree Neural Network model would be much more competitive with the original Neural Network.

In Chapter 3, you see how to better configure the Tree node for selecting neural network inputs.

2.3 Deconstructing Neural Networks 2-25

2.3 Deconstructing Neural Networks

While neural networks have been introduced as a generalization of standard regression models, practitioners often use them as a type of ‘predictive algorithm.’ As is discussed in Chapter 3, predictive algorithms, as opposed to predictive models, assume little about the underlying structure of the model. They are designed to be extremely flexible and, without proper tuning, always overgeneralize the training data. Their success depends on restricting their flexibility to match the prediction problem at hand.

This section illustrates how Neural Network models can be used as predictive algorithms and how you can guard against overgeneralization.


Building a High Capacity Neural Network

To start, construct a neural network with a seemingly unreasonable number of parameters. This should result in a highly overgeneralized model.

1. Connect another Neural Network node to the Regression node.

2. Change the name of the added node to Big Neural Network.

3. Open the Big Neural Network.

4. Select the General tab.

5. Select the Advanced user interface checkbox. The Basic tab deactivates and the Advanced tab activates.

6. Select the Advanced tab. The Neural Network window shows a schematic of the network diagram. The cyan pentagons on the left represent input variables. The blue square in the middle represents a hidden layer with three hidden units. The yellow pentagon on the right represents the target variable.

In general, the Advanced tab allows for finer control of the network architecture and training process. In the next steps, you use this control to define a network with 40 hidden units and select a non-default optimization method.

Change the number of hidden units from the default of 3 to 40.


1. Double-click the blue square representing the hidden layer at the center of the diagram. The Node properties window opens.

2. Select the Hidden tab. The Hidden tab controls the number of neurons in the hidden layer and facts pertaining to all the nodes in the layer.

3. Type 40 in the Number of neurons field.

4. Select the Initial tab. The Initial tab controls the initialization of the model parameters. This predictive algorithm approach to neural network requires small starting values for the model parameters, typically between 0.01 and 0.1. The reason is discussed in the next demonstration.


5. Type 0.05 in the Scale field. This is the scale of the initialization for the bias terms of the neurons.

6. Select the Altitude button. The initialization information for the weight parameters appears.

7. Type 0.05 in the Scale field.

8. Select OK to close the Node properties window. The diagram shows the hidden layer with 40 neurons.

Now change the training method from the default to the Double Dogleg method.

1. Select the Train subtab. The Neural Network window displays the default training method selected.

2. Select the Default Settings checkbox. With Default Settings unchecked, it is possible to change the training technique.

3. Select the Double Dogleg Training Technique. The Double Dogleg technique combines the default training method (for the current network architecture) with gradient descent method. It also tends to take more steps to converge to a minimum training error than the default.

4. Close the Neural Network window and save the changes.

5. Name the model BigNeural.

6. Run the Big Neural Network node and view the results.

The validation overall average profit of $0.1623 is slightly higher than all the other methods tried thus far. Note that this “model” has over 400 parameters, whereas the Polynomial Regression has 9. You would expect with so many parameters that the model would badly overgeneralize. This, however, has not happened.


Taming Overgeneralizations

The large number of parameters in neural network models tends to make them prone to overgeneralization. Enterprise Miner, by default, takes steps to guard against this problem. Sometimes, however, generalization can be improved by adjusting some of defaults.

This demonstration illustrates how the overgeneralization precautions work. Therefore the focus (temporarily) shifts from build the best possible model to understanding the details of network optimization.

Setup the neural network demonstration as follows.

1. Connect a Control Point node to the Replacement node.

The Control Point node simply serves as a junction for other nodes. It is used here to simplify the appearance of the diagram.

2. Connect a Neural Network node to the Control Point.

3. Change the name of the node just added to Two Input Network.

4. Open the Two Input Network node.

5. Set the Status of all inputs to don’t use.


6. Set the Status of LIFETIME_CARD_PROM and LIFETIME_PROM back to use.

This creates a neural network with exactly two inputs, which allows you to see how the predicted values change as a function the inputs.

7. Select the Basic tab and define a Multi-layer Perceptron model with 4 hidden neurons.

8. Select the Output tab.

9. Select Process or Score: Training, Validation, and Test. This option appends predicted values to the data sets passed out of the Two Input Network node.

10. Close the Neural Network window and save the changes. Name the model VisiNeural.

Now examine the modeling results.

1. Run the Two Input Network node and view the results.



The Plot tab shows the value of the objective function versus training iteration. Parameter estimates were taken from the first training iteration.

3. Right-click in the plot and select Profit from the pop-up menu.

The Profit plot shows that the network attains maximum validation profit on the first iteration. Notice that the training profit continues to increase past this point.

What is the consequence of not using the first iteration for model parameter estimates?

1. Select one of the two plots at the last training iteration. The black line moves to this point.


2. Right-click in the plot area and select Set network at… Set network at selected iteration.

3. Select Tools Score from the SAS menu.

4. Select Yes to score with the current settings.

5. Select OK when Enterprise Miner finishes scoring.

The Two Input Network has scored the training and validation data with parameter estimates taken from the last training iteration. With only two inputs, it is possible to actually see what the predictions look like.

1. Close the Results window.

2. Connect an Insight node to the Two Input Network node.

3. Run the Insight node and view the results. An Insight data table opens with a random sample of 2,000 cases from the training data. This should be sufficient to make a plot of the predictions.

4. Select Analyze Rotating Plot ( Z Y X ) from the SAS menu. The Rotating Plot ( Z Y X ) window opens.

5. Select the first variable in the variable list, P_TARGET_B1. This is the predicted value of the TARGET_B produced by the Two Input Network model.

6. Select the Y button.

7. Select LIFETIME_CARD_PROM and then select the Z button.

8. Select LIFETIME_PROM and then select the X button.

9. Select the Output button. The Rotating Plot ( Z Y X ) window opens.


10. Select At Minima under Axes: and select OK. The second Rotating Plot ( Z Y X ) window closes.

11. Select OK in the original Rotating Plot ( Z Y X ) window.A rotating plot showing the Two Input Network predictions versus LIFETIME_CARD_PROM and LIFETIME_PROM appears. You can rotate the plot by dragging the pointer in the corners of the plot.

The plot shows a complex association between the inputs and the predicted target values, somewhat reminiscent of a mountain pass.

It is informative to contrast the appearance of neural network predictions with those of a standard logistic regression model.

1. Close the Rotating Plot window and the Insight data table.

2. Connect a Regression node to the Control Point node.

3. Change the name of the node just added to Two Input Regression.

4. Open the Two Input Regression node.


5. Reject all inputs except LIFETIME_CARD_PROM and LIFETIME_PROM.

6. Set the Process or Score: Training, Validation, and Test option in the Output tab.

7. Close and save changes to the Two Input Regression node. Name the model VisiReg.

Setup the Insight node to view the Two Input Regression model.

1. Disconnect the Insight node from the Two Input Network node and connect it to the Two Input Regression node.

2. Run the Insight node and view the results. This automaticallys run the Two Input Regression node as well.

3. Construct a Rotating Plot ( Z Y X ) as before, using the variables P_TARGET_B1, LIFETIME_CARD_PROM, and LIFETIME_PROM.

The modeled association is much simpler with the Two Input Regression model. Instead of a mountain pass, the association appears as a smooth rise.

Which model is “correct?” While this question is impossible to answer definitively, you can assess the performance of the two models on a set of validation data.

1. Connect an Assessment node to both Two Input models.

2. Copy and paste the SAS Code node used to adjust the assessment profit.

3. Connect the copied SAS Code node to the Assessment node.


4. Run the SAS Code node. Do not view the results.

5. Open the Assessment node.

6. Select the models and draw a lift chart.

For most depths, the Neural model has smaller %Response than the Regression model.

7. Select the Profit button. The adjusted profit calculations show, in general, the neural model’s overall average profit is lower than the regression model’s values.

8. Close the Lift Chart windows.


You can correctly argue that the comparison was unfair. The parameter estimates from the neural network model were intentionally adjusted to correspond to the final step of training, rather than the step where validation profit was maximized. This was done to illustrate consequences of overtraining a network model.

1. Open the Two Input Network Results window.


3. Change the plot to display Profit.

4. Click on the point of maximum validation profit.

5. Right-click in the plot area and select Set network at… Set network at selected iteration.

6. Select Tools Score from the SAS menu.

7. Select Yes to score with the current settings.

8. Select OK.


The parameter estimates for the neural network model now correspond to their original values after training. How has this affected the appearance of the model?

1. Disconnect the Insight node from the Two Input Regression node and connect it once more to the Two Input Network node.


For clarity the positions of the Insight node and the Assessment/SAS Code nodes have been reversed.

2. Run the Insight node and produce the predicted values plot as before.

The Two Input Neural predictions look very similar to the Two Input Regression predictions.

3. Close the Insight windows.

4. Run the SAS Code node attached to the Assessment node. Do not view the results.

5. Open the Assessment node and draw gains and profit charts.


The models produce nearly identical results.

While the Two Input Network model found a complex input/target association, most of this association was an artifact of the training data. By monitoring the model performance on a set of validation data, it was possible to stop the training once overgeneralization appeared. In this way, the Two Input Network model correctly mimicked the behavior of the simple Two Input Regression model.

Enterprise Miner automatically implements this technique, called stopped training, to restrict the flexibility of neural network models. But will this technique always work?


2. Change the number of hidden neurons to 15.

3. Close the Two Input Network node and save the changes.


5. Select the Plot tab and plot Profit versus Iteration.

Stopped training takes the values of the model parameters from the second iteration.

1. Run the Insight node and view the results.

2. Plot the predicted target values versus LIFETIME_CARD_PROM and LIFETIME_PROM.


The Two Input Network once more assumes a mountain-pass-like appearance. This time, however, the appearance is seen after stopped training.


2. Run the SAS Code node connected to the Assessment node. Do not view the results.

3. Open the Assessment node and draw a Profit chart.

For a majority of depths, the Two Input Network model is yielding lower overall average profits than the Two Input Regression model. Unfortunately this is true even after stopped training.

The Two Input Network model appears to be hopelessly overparameterized. It tries to use 61 parameters to model an association adequately modeled by 3. Will overparameterized neural network models always do worse than simpler models?

Surprisingly, the answer is not necessarily. The problem here is caused more by an over ambitious optimization algorithm than an overparameterized model.

The rapid changes in predicted target values characteristic of overgeneralized predictive models result from large weight values multiplying the inputs (think partial derivatives).

1. Open the Two Input Network results window and select the Plot tab.


2. Right-click in the plot area and select Weights from the pop-up menu.

3. Select the first eight weights connecting the hidden neurons to the target.

You can only select a maximum of eight weights to plot at a time.

The weight values start small and get large in one iteration. This is good to model fine-grain associations unique to the training data. It is not good to produce a model that captures broad trends and generalizes well.

You can remedy this problem by initializing the weights to small values and making small changes to their values as training progresses.



3. Select the General tab and activate the Advanced user interface.

4. Select the Advanced tab.

The first task is to initialize the weights to small values. You did this earlier when fitting the Big Neural Network model.

1. Double-click the hidden layer.

2. Select the initial tab.

3. Change the Scale field to 0.05 for both Bias and Altitude.

4. Close the Node Properties window.

Now change the optimization procedure to Double Dogleg. The default procedure, Levenberg-Marquardt (for networks with less than 100 parameters), takes giant steps during training. The result is a network that is overgeneralized after the first iteration and even more so as training progresses. The Double Dogleg procedure takes much smaller steps allowing the network more time to find parameter values that yield good generalization.

1. Select the Train subtab.

2. Uncheck the Default Settings control.


3. Select Double Dogleg as the Training Technique.

4. Close the Two Input Network node and save the changes.


6. View the Profit plot.

The optimal validation profit now occurs before the model reaches optimal profit on the training data. There is hope for the model to generalize well.

7. Plot the same weights as before.

The weights are bounded between –1 and +1. On the selected iteration, the weights are still quite small. This is consistent with a slowly varying model.


2. Produce the usual Rotating Plot in Insight.


The predicted values again rise slowly with respect to the inputs.


2. Run the SAS Code node to modify the Assessment data.

3. Open the Assessment node and draw a gains chart and a profit chart.

Again, the models are virtually indistinguishable.

By using small starting values for the weights and choosing a slow optimization process, good results are achievable even for overparameterized neural networks.

Chapter 3 Predictive Algorithms

3.1 Constructing Trees.........................................................................................................3-3

3.2 Constructing Trees.......................................................................................................3-23

3.3 Applying Decision Trees..............................................................................................3-28

3-2 Chapter 3 Predictive Algorithms

3.1 Constructing Trees 3-3

3.1 Constructing Trees

Recursive partitioning models, commonly called decision trees after the form in which the results are presented, have become one of the most ubiquitous of predictive modeling tools. Tree models may not yield the largest generalization profit, but they are invaluable in improving performance and aiding understanding of other predictive models.

Unlike parametric models, decision trees do not assume a particular structure for the association between the inputs and the target. This allows them to detect complex input and target relationships missed by inflexible parametric models. It also allows them, if not carefully tuned, to overgeneralize from the training data and find complex input and target associations that do not really exist.

Trees are the primary example of a class of predictive modeling tools designated predictive algorithms. Predictive algorithms are a motley assembly of often ad hoc techniques with intractable statistical properties. Their use is justified by their empirical success. In addition to decision trees, other common examples of predictive algorithms are nearest neighbor methods, naïve Bayes models, support vector machines, over-specified neural networks, and non-parametric smoothing methods.

Tree Algorithm Parameters

Maximum Branches

Split Worth Criterion

Stopping Options

Pruning Method

Missing Value Method

…

?

Adjusted Chi-Sq. Logworth

Average Profit

Best Leaf

Logworth ThresholdDepth Adjustment

Max. Depth, Min. Leaf Size

2Default Settings

The behavior of the tree algorithm in Enterprise Miner is governed by many parameters that can be roughly divided into five groups:

the number of subpartitions to create at each partitioning opportunity

the metric used to compare different partitions

the rules used to stop the partitioning process

the method used to tune the tree model

the method used to treat missing values.


The defaults for these parameters generally yield good results for initial prediction. As discussed in later sections, varying some of the parameters may improve result for auxiliary uses of tree models.

Tree Algorithm: Calculate Logworth

Training Datax1

0.7

Logworth

x1

Understanding the default algorithm in Enterprise Miner for building trees enables you to better use the Tree tool and interpret your results. The description presented here assumes a binary target, but the algorithm for interval targets is similar. The algorithm for categorical targets with more than two levels is more complicated and is not discussed.

The first part of the algorithm is called the split search. The split search starts by selecting an input for partitioning the available data. If the measurement scale of the selected input is interval, each unique value serves as a potential split point for the data. If the input is categorical, the average value of the target is taken within each categorical target level. The averages serve the same role as unique interval input values in the discussion that follows.

For a selected input and fixed split point, two groups cases are generated. Cases with input values less than the split point are said to branch left. Cases with input values greater than the split point are said to branch right. This combined with the target levels forms a 2x2 contingency table with columns specifying branch direction (left or right) and rows specifying target value (0 or 1). A Pearson chi-squared statistic is used to quantify the independence of counts in the table’s columns. Large values for the chi-squared statistic suggest the proportion of 0’s and 1’s in the left branch is different than the proportion in the right branch. A large difference in target level proportions indicates a good split.

Because the Pearson chi-squared statistic may be applied to the case of multi-way splits and multi-level targets, the statistic is converted to a probability value or p-value. The p-value indicates the likelihood of obtaining the observed value of the statistic assuming identical target proportions in each branch direction. For large data sets, these p-values can be very close to 0. For this reason, the quality of a split is reported by logworth = -log10(chi-squared p-value).


Logworth is calculated for every split point of an input. At least one logworth must exceed a threshold for a split to occur with that input. By default this threshold corresponds to a chi-squared p-value of 0.20 or a logworth of approximately 0.7.

Training Datax1

0.7

Logworth

Tree Algorithm: Filter Partitions

x1

The Tree algorithm settings disallow certain partitions of the data. Settings, such as the minimum number of observations required for a split search and the minimum number of observations in a leaf, force a minimum number of cases in a split partition. This reduces the number of potential partitions for each input in the split search.

Training Datax1

0.7

Logworth

Tree Algorithm: Adjust Logworth

x1

Kass Adjusted

When calculating the independence of columns in contingency table, it is possible to obtain significant (large) values of the chi-squared statistic even when there are no differences in the target level proportions between split branches. As the number of possible split points increases, the likelihood of this occurring also increases. In this way, an input with a multitude of unique input values has a greater chance of


accidentally having a large logworth than an input with only a few distinct input values.

Statisticians face a similar problem when combining the results from multiple statistical tests. As the number of tests increases, the chance of a false positive result likewise increases. To maintain overall confidence in the statistical findings, statisticians inflate the p-values of each test by a factor equal to the number of tests being conducted. If each inflated p-value shows a significant result, then the significance of the overall results is assured. This type of p-value adjustment is known as a Bonferroni correction.

Because each split point corresponds to a statistical test, Bonferroni corrections are automatically applied to the logworths calculations for an input. These corrections, called Kass adjustments after the inventor of the default Tree algorithm used in Enterprise Miner, penalize inputs with many split points. Multiplying p-values by a constant is equivalent to subtracting a constant from logworth. The constant relates to the number of split points generated by the input. The adjustment allows a fairer comparison of inputs with many and few levels later in the split search algorithm.

The adjustment also increases the chances of an input’s logworth not exceeding the threshold.

Training Datax1

0.7

Logworth

Missing in left branchMissing in right branch

Tree Algorithm: Partition Missings

x1

Kass Adjusted

For inputs with missing values, two sets of adjusted logworths are actually generated. The two sets are calculated by including the missing values in the left branch and right branch, respectively.


Training Datax1

0.7


Best Split x1

Tree Algorithm: Find Best Split for Input

x1

The best split for an input is the split that yields the highest logworth. Because the logworth calculations also account for missing input values, the tree algorithm optimally accounts for inputs with missing values.

Training Datax2

0.7


Logworth

Tree Algorithm: Repeat for Other Inputs

x2

Kass Adjusted

The partitioning process is repeated for every input in the training data. Inputs whose adjusted logworth fail to exceed the threshold are excluded from consideration.


Training Data

0.7


Best Split x2

Tree Algorithm: Compare Best Splits

x2

Best Split x1

x1

After determining the best split for every input, the tree algorithm compares each best split’s corresponding logworth. The split with the highest adjusted logworth is deemed best.

Training Data

Best SplitTree Algorithm: Partition with Best Split

x1

x2

The training data is partitioned using the best split. The expected values of the target and profit are calculated within each leaf. This determines the optimal decision for the leaf.


Training Data

Tree Algorithm: Repeat within Partitions

x1

x2

Training Datax1

0.7


KassAdjustedLogworth

Tree Algorithm: Calculate Logworths

x1

The split search continues within each leaf. Logworths are calculated and adjusted as before.


Training Datax1

1.0


DepthAdjustment

Tree Algorithm: Adjust for Split Depth

x1

KassAdjustedLogworth

Because the significance of secondary and subsequent splits depend on the significance of the previous splits, the algorithm once more faces a multiple comparison problem. To compensate for this, the algorithm increases the threshold by an amount related to the number of splits above the current split. For binary splits, the threshold is increased by log10(2)· d ≈ 0.03·d, where d is the depth of the split on the decision tree.

By increasing the threshold for each depth (or equivalently decreasing the logworths), the Tree algorithm makes it increasingly easy for an input’s splits to be excluded from consideration.

Training Datax1

1.0


BestSplit x1

Tree Algorithm: Find Best Split for Input

x1


Training Datax2

1.0

Kass AdjustedLogworth


Tree Algorithm: Repeat for Other Inputs

x2

Training Data

1.0


Best Split x2

Tree Algorithm: Compare Best Splits

x1

x2

Best Split x1

The best split using each input is identified, and the splits are compared as before.


Training Data

Tree Algorithm: Partition with Best Split

x1

x2

The data is partitioned according to the best split. The process repeats in each leaf until there are no more allowed splits whose adjusted logworth exceeds the depth-adjusted thresholds. This completes the split search portion of the tree algorithm.

Training Data

Tree Algorithm: Construct Maximal Tree

x1

x2

The resulting partition of the input space is known as the maximal tree. Development of the maximal tree was based exclusively on statistical measures of split worth on the training data. It is likely that the maximal will fail to generalize well on an independent set of validation data.

The second part of the Tree algorithm, called pruning, attempts to improve generalization by removing unnecessary or poorly performing splits. Pruning generates a sequence of trees starting with the maximal tree and decreasing to the root tree (a tree with one leaf). Each step of the pruning sequence eliminates one split from the maximal tree.


Training Data

Option 1: ∆Profit =0

Tree Algorithm: Prune Maximal Tree

Splits to Remove: 1x1

x2

Training Data




x2

Training Data


Tree Algorithm: Prune Smallest Loss

Splits Removed: 1x1

x2


The first pruning step eliminates a single split from the maximal tree. The change in overall average profit caused by the removal of a given split is calculated. The split that least changes the overall average profit of the Tree model is removed.

Training Data




x2

Training Data




x2


Training Data



Splits Removed: 2x1

x2

The second pruning step eliminates two splits from the maximal tree. Because splits are removed from the maximal tree, it is possible that the tree obtained from the second pruning step will not be a subtree of the tree obtained in the first pruning step.

Once more, the splits removed are those that change the overall average profit of the Tree model by the smallest amount.

Training Data




x2


Training Data

Option 2: ∆Profit = -0.0326



x2

Training Data



Splits Removed: 3x1

x2

Training Data

∆Profit = -0.0326

Splits Removed: 4


x1

x2


Training Datax1

x2 ∆Profit = -0.2464

Splits Removed: 5


The process continues until only the root of the tree remains. Because there is only one way to generate a two-leaf and one-leaf tree from the maximal tree, no comparisons are necessary.

Tree Algorithm: Select Optimal Tree

x1

x2

Training

Profit

Validation

Leaves1 2 3 4 5 6

The smallest tree with the highest validation profit is chosen from the trees generated in the pruning process as the final tree model.


Tree Variations: Multiway Splits

Trades height for width

Complicates split search

Uses heuristic shortcuts

Enterprise Miner allows for a multitude of variations on the default Tree algorithm. The first involves the use of multiway splits instead of binary splits. This option is invoked by changing the Maximum number of branches from a node field in the Basic tab of the Tree window.

Theoretically, there is no clear advantage in doing this. Any multiway split can be obtained using a sequence of binary splits. The primary change is cosmetic. Trees with multiway splits tend to be wider than trees with only binary splits.

The inclusion of multiway splits complicates the split search algorithm. A simple linear search becomes a search whose complexity increases geometrically in the number of splits allowed from a leaf. To combat this complexity explosion, the Tree tool in Enterprise Miner may resort to heuristic search strategies. These strategies are invoked if the number of number possible splits exceeds the number specified in the Advanced tab’s Maximum tries in an exhaustive split search field (5,000 by default).


Tree Variations: Split Worth Criteria

Yields similar splits

Grows enormous trees

Favors inputs with many levels

In addition to changing the number of splits, you can also change how the splits are evaluated in the split search phase of the Tree algorithm. For categorical targets, Enterprise Miner offers three separate split worth criteria. Changing from the Chi-squared default criterion typically yields similar splits if the number of distinct levels in each the input is similar. If not, the other split methods tend to favor inputs with more levels due to the multiple comparison problem discussed above. You can also cause the chi-square method to favor inputs with more levels by turning off the Kass adjustments.

Because Gini reduction and Entropy reduction criteria lack the significance threshold feature of the Chi-square criterion, they tend to grow enormous trees. Pruning and selecting a tree complexity based on validation profit limits this problem to some extent.


Tree Variations: Stopping Rules

Avoids orphan nodes

Controls sensitivity

Grows large trees

The family of adjustments you will modify most often when building trees are the rules that limit growth of the tree. Changing the minimum number of observations required for a split search and the minimum number of observations in a leaf prevents the creation of leaves with only one or a handful of cases. Changing the significance level and the maximum depth allows for larger trees that may be more sensitive to complex input and target associations. The growth of the tree is still limited by the depth adjustment made to the threshold. If want really big trees and insist on using the chi-square split worth criterion, deselect the Depth option in the Advanced tab.


Tree Variations: Pruning and Missing Values

Controls sensitivity

Helps input selection?

The final set of Tree algorithm options pertain to the pruning missing value methods.

By changing the Model assessment measure in the Advanced tab from Average profit to Total leaf impurity (Gini index), you can construct what is known as a class probability tree. Theoretically, class probability trees are pruned to minimize the mean squared prediction error. It can be shown that this, in turn, minimizes the imprecision of the tree. Analysts sometimes use this model assessment measure to select inputs for a flexible predictive model such as neural networks.

You can deactivate pruning entirely by setting the Sub-tree field to The most leaves. You may want to do this if you use the tree for variable selection, as in Section 3.2, or if you want to ensemble tree models.

The default method for missing values is to place them in the leaf maximizing the logworth of the selected split. Another way to handle missing values is the construction of surrogate splitting rules. At each split in the tree, inputs are evaluated on their ability to mimic the selected or primary split. The surrogate splits are rated


on their agreement with the primary split. Agreement is the fraction of cases ending up in the same branch using the surrogate split as with the primary split. If a surrogate split has high agreement with the primary split, it may be used in place of the primary split when the input involved in the primary split is missing. You can specify how many surrogates you want the tree algorithm to keep.

Surrogate rules may also aid in the task of input selection.


3.2 Constructing Trees

Thus far, using decision trees has been a completely automated process. After defining the model parameters, the tree models simply grew by themselves.

In this demonstration, you closely control the construction of a tree model through interactive training. Interactive training allows you to manually specify the splits in a tree model. This enables you to customize the tree model to incorporate business rules into the model, prune unwanted nodes, and investigate the effect of alternative split inputs.


Interactively Constructing a Tree

Interactive tree construction can be used to define some or all of a tree model. In this demonstration, you manually define the first split and allow the automated tree algorithm to build the rest of the tree.

1. Run the original Tree node and view the results.

2. View the Tree diagram.

3. Select the root node and view the competing splits.

The initial split is on the input RECENT_RESPONSE_COUNT with a logworth of 26.915. A close contender is FREQUENCY_STATUS_97NK with a logworth of 25.358.

The frequency input was featured strongly in the Regression model. It also has long been used in direct marketing circles as a good predictor of response. Suppose you would like to use it in your predictive model as the first split instead of the response count input. You can do so using interactive training.

1. Close the Competing Splits, Tree Diagram and Results-Tree windows.

2. Connect a Tree node to the Data Partition node and label the node Interactive Tree.

3. Right-click the Interactive Tree node and select Interactive… from the pop-up menu. The Interactive Training window opens.


4. Select View Tree from the SAS menu. The Tree Diagram window opens, showing the currently defined tree (a root node).

5. Right-click the root node and select Create Rule. The Create Rule window opens, listing the maximum Kass adjusted logworths for each input in the training data.

6. Select the FREQUENCY_STATUS_97NK row.

7. Select the Modify Rule button. The window shows the definition of the optimal split on the frequency status input.


8. Select OK to close the Splitting Rule window.

9. Select OK in the Create Rule window to accept the frequency status split, and close the Create Rule window. The Tree Diagram window is updated to show the selected split.

Cases with a 97NK frequency status greater than 2 are nearly twice as likely to respond as cases with a 97NK frequency status less than or equal to 2.

Use this single split as the starting point for an algorithmically generated tree.

1. Close the Tree Diagram window and the Interactive Training window. You are confronted with a question that is difficult to answer.

2. The correct answer is Yes. This takes the split you just defined and uses it as the starting point for a tree model.

3. Run the Interactive Tree node and view the results. The select tree has a validation overall average profit slightly higher than the automatically generated tree.

4. View the tree diagram to confirm the initial split is with the FREQUENCY_STATUS_97NK input.


The root node split determines much of a tree’s behavior. If there are several competing root node splits with similar logworths, it is useful to examine the effect of selecting alternatives. By doing so, you may discover a more compact description of the association under investigation or a description that more closely matches domain intuition.


3.3 Applying Decision Trees

Tree Applications

Missing value imputation

Categorical input consolidation

Variable selection

Surrogate models

x1 x2 x3 x4

≈

Tree models play important supporting roles in predictive modeling. They provide an excellent way to impute the missing value of an input conditioned on non-missing inputs.

They can be used to • group the levels of a categorical input based on the value of the target • select inputs for other flexible modeling methods like neural networks • explain the predictions of other modeling techniques.

3.3 Applying Decision Trees 3-29

Imputing Missing Values with Trees

In the models built to this point, missing values have been replaced by the overall mean of the input. This approach fails to take advantage of correlations existing within the inputs themselves, which may provide a better estimate of a missing input’s actual value. In this demonstration, you see how to use a tree model to predict the value of an input conditioned on other inputs in the data set.

Why not use another modeling technique to do imputation? The answer lies in another “What” question: What do you do if, in order to predict the value of one input, you need the value of another input that also happens to be missing? For tree- based imputation schemes, this is not an issue. The tree models simply rely on their built-in missing value methods and produce an estimate for the original missing value.

If you have a data set with 50 inputs, the proposition of building 50 separate tree models, one to predict the missing value of each input, seems like a daunting task. Fortunately, Enterprise Miner makes this process easy.

1. Open the saved diagram, PVA Analysis Chapter 1.

Modify the diagram for use in this chapter. As in Chapter 2, certain elements of the analysis from Chapter 1 are unnecessary here.

2. Delete the indicated nodes.

Tree-based imputation is handled through the Replacement node.

1. Open the Replacement node and select the Imputation Methods subtab.


The Imputation Methods subtab controls the imputation method used for all inputs. To override this default and choose a different method for an individual input, select the Interval Variables and Class Variables tabs.

By default, missing values for interval inputs are replaced by the corresponding input means, and missing values for categorical inputs are replaced by the corresponding input modes (most frequent value).

2. Change the Interval Variables Method to tree imputation.

3. Change the Class Variables Method to tree imputation.

When run, each input in the training data takes a turn as a target variable for a tree model. When an input has a missing value, Enterprise Miner uses the corresponding tree to replace unknown value of the input.

By default, the Replace node uses a random sample of 2,000 cases to build the imputation tree models. This is done to speed the imputation modeling process. If you have sufficient computer resources or a relatively small number of inputs, you can override these defaults and use the entire training data set to build the imputation models. Select the Data tab and the Training subtab, then select Imputation Based on: Entire data set.

1. Close the Replacement window and save the changes.

2. Run the Replacement node and view the results.


Inspection of the Results window’s Table View tab shows DONOR_AGE imputed with at lease three distinct values.

3. Select the Code tab.

The recipe for tree imputation of each input is included as part of the scoring code for the entire model.

Does improved prediction of missing values improve prediction of the target?



The regression model using tree-imputed data has a slightly lower overall average profit for the validation data. In general, standard regression models are insensitive to changes in the imputation method. This may not be true for more flexible modeling methods.


Consolidating Categorical Inputs

Categorical inputs pose a major problem for parametric predictive models such as regressions and neural networks. Because each categorical level must be coded by an indicator variable, a single input can account for more model parameters than all other inputs combined.

Decision trees, on the other hand, thrive on categorical inputs. They can easily group the distinct levels of the categorical variable together and produce good predictions.

This demonstration shows how to use a tree model to group categorical input levels and create useful inputs for regression and neural network models.

1. Connect a Tree node to the Replacement node as shown. Label the node Consolidation Tree.

2. Open the Consolidation Tree node and change the status of all inputs to don’t use.

The input categorical input CLUSTER_CODE has more than 50 distinct levels. With so many distinct levels, its usefulness as an input in a regression or neural network model is limited. Use a tree model to group these levels based on their association with the TARGET_B and create a new model input. This input can used in place of CLUSTER_CODE in a regression or other model.

3. Set the status of CLUSTER_CODE to use.

4. Run the Consolidation Tree node and view the results.


Disappointingly, the tree algorithm found no significant splits. The primary reason for this is the Kass adjustment to logworth discussed in the previous section. The adjustment penalizes the logworth of potential CLUSTER_CODE splits by an amount equal to the log of the number of partitions of CLUSTER_CODE levels into two groups, or log10(2L-1 – 1). With 54 distinct levels, the penalty is quite large.

It is also quite unnecessary. The penalty avoids favoring inputs with many possible splits. Here you are building a tree with only one input. It is impossible to favor this input over others because there are no other inputs.

1. Open the Consolidation Tree node and select the Advanced tab.

2. Deselect the Kass p-value adjustment.

3. Run the Consolidation Tree model and view the results.


The selected partition groups the 54 CLUSTER_CODES into two groups.

To use the grouped values of CLUSTER_CODE in a subsequent model, you must add the predicted values to the training and validation data.

1. Close the Results window and once more open the Consolidation Tree node.

2. Select the Score tab and then select Process or Score: Training, Validation, and Test.

3. Select the Variables subtab.

4. Deselect all checkboxes except Leaf identification variable.

5. Close the Tree Model window and save the changes.

6. Run the Consolidation Tree node. You need not view the results.

The Tree node adds a variable called _NODE_ to the training data. To use this variable in a subsequent analysis, you must change its Model Role to input. This is done using a Data Set Attributes tool.

1. Add a Data Set Attributes node to the diagram as shown.


2. Open the Data Set Attributes node. The Data Set Attributes window opens.

The Data tab lists three data sets exported from the Consolidation Tree node. The first is the Outtree data set generated by the SAS procedures underlying the Tree node. The second and third are the training and validation data sets.

3. Select the training data set (second from the top) and select the Variables tab.

The Variables tab displays the current metadata settings for the training data. You can change these settings by right-clicking in the one of the white columns.

4. Scroll the variables list to show the variable called _NODE_.


The Consolidation Tree model assigns each case to a leaf or node. The _NODE_ variable identifies this leaf. You can use this variable as a consolidation of the original CLUSTER_CODE input.

By default, Enterprise Miner assigns a Model Role of group to the _NODE_ variable. You must change its role to input.

5. Right-click on the Model Role column for CLUSTER_CODE and select Set new model role input.

6. Similarly, change the model role of CLUSTER_CODE to rejected.

7. Close the Data Set Attributes window.

Now see whether the newly created input is useful enough to be selected in the regression model.

1. Connect a Regression node to the Data Set Attributes node. Label the node Consolidation Regression.

2. Open the Consolidation Regression node and verify the input _NODE_ has been added to the variables list.

3. Select the Selection Method tab and select the stepwise method.

4. Close the Linear and Logistic Regression window and save the changes. Name the model Consolidate.


6. The overall average profit on the validation data is higher than the other standard regression model.


7. Select the Output tab and scroll to the bottom of the report.

Not only is _NODE_ selected as an input, cases in the left branch of the Consolidation Tree (node 2) are 21% less likely to respond than cases in the right branch (node 3).


Selecting Inputs with Tree Models

Trees can be used to select inputs for flexible predictive models. They have an advantage over using a standard regression model for the same task when the inputs’ relationship to the target is nonlinear or nonadditive. While this is probably not the case in this demonstration, the selected inputs still provide the support required to build a reasonably good neural network model.

1. Connect a Tree node to the Data Set Attributes node. Name the node Selection Tree.

While you can use the Tree node with default settings to select inputs, this tends to select too few inputs for a subsequent model. Two changes to the Tree defaults will result in more inputs being selected. Generally, when using trees to select inputs for neural network models, it is better to err on the side of too many inputs rather than too few. The changes to the defaults act independently. You can experiment to discover which method generalizes best with your data.

2. Open the Selection Tree node

3. Select the Basic tab.

4. Change the number of the surrogate rules saved in a node to 1.

This change allows inclusion of surrogate splits in the variable selection process. By definition, surrogate inputs are typically correlated with the selected split input. While it is usually bad practice to include redundant inputs in predictive models, neural network can tolerate some degree of input redundancy. The advantage of including surrogates in the variable selection is to allow inclusion of inputs that do not appear in the tree explicitly but are still important predictors of the target.

5. Select the Advanced tab.

6. Change the Sub Tree to The most leaves.


With this setting, the tree algorithm does not attempt to prune the tree. Like adding surrogate splits to variable selection process, it tends to add (possibly irrelevant) inputs to the selection list. By limiting their flexibility, neural networks can cope with some degree of irrelevancy in the input space.

7. Close the Tree Model window and save the changes.

8. Run the Selection Tree node and view the results.

Select the Score tab and the Variable Selection subtab. The Results window shows the selected inputs, their importance statistic, and the number of times the input appears in the tree. Inputs that appear in 0 Rules are those that only serve as surrogates.

The list includes many more inputs than were originally selected by the Regression node. How well do these predictors work in a model?

1. Connect a Neural Network node to the Selection Tree node.


2. Open the Neural Network node.

3. Create a multi-layer perceptron model with four hidden neurons.

4. Close the Neural Network node and save the changes.

5. Run the Neural Network node and view the results.

The validation overall average profit is the highest observed thus far. By trying different architectures, it is possible to further increase the validation profit.


Decision Segment Definition

The usual criticism of neural networks and similar flexible models is the difficulty in understanding the predictions. This criticism stems from the complex parameterizations found in the model. While it is true that little insight may be gained by analyzing the actual parameters of the model, much may be gained by analyzing the resulting predictions and decisions.

In this demonstration, a decision tree is used to isolate cases with sufficiently high predicted probabilities (as calculated by a neural network model) to warrant solicitation. In this way, the characteristics of likely donors can be understood even if the model estimating the likelihood of donation is inscrutable.

1. Open the Neural Network node and select the Output tab.

2. Select Process or Score: Training, Validation, and Test.

3. Close and save changes to the node.

4. Run the Neural Network node. You need not view the results.

The predictions and decisions made by the Neural Network model have been added to the training and validation data sets. You now build a decision tree to describe cases for which the decision is to solicit. While not intended as a substitute for the original model, the description supplies insight into characteristics associated with likely responders.

1. Connect Data Set Attributes node to the Neural Network node.

2. Open the Data Set Attributes node and select the training data set.


4. Set the Model Role of TARGET_B to rejected.

5. Set the Model Role of D_TARGET_B_ to target.


6. Close and save changes to the node.

Now build the description tree.

1. Connect a Tree node to the Data Set Attributes node as shown. Rename the Tree node Description Tree.


The default model performance measure, accuracy, is presented in the assessment table. Because the target variable’s value is deterministic (it comes from the equation that defines the neural network), the accuracy on training and validation will be virtually identical.

Of the 9,685 decisions made on the training data, the Tree model correctly reproduces 8,471, or 87% accuracy.

1. View the tree diagram. The initial split occurs on RECENT_RESPONSE_COUNT. Around 82% of individuals who have recently responded to three or more solicitations will receive a solicitation using the Neural Network model. Additional splits refine these rules.

You can use the Define Colors option to emphasize the most frequent decision in a node.

2. Select Tools Define colors from the menu bar.


3. Select Proportion of a target value.

4. Select 1 from the Select a target value list.

5. Select OK.

Nodes that have a majority of solicit decisions are colored green. Nodes that have a majority of ignore decisions are colored red. Nodes with a mixture of decisions are colored yellow.

The selected Tree model has 32 leaves, which is somewhat large for interpretation purposes.

Change the number of leaves to 19. The accuracy has been reduced by less than 1%, but the tree is simpler. • The largest segment solicited corresponds to the largest green wedge. Overall, the

segment contains approximately 20% of the cases in the target population, and more than 98.5% of these individuals are solicited. Cases in this segment have more than two recent responses, a 97NK frequency status greater than 2, and PEP star donor status.

• The largest segment not solicited corresponds to the largest red wedge. Overall, the segment contains approximately 28% of the cases in the target population, and more than 90% of these individuals are ignored. Cases in this segment have fewer than two recent responses, median home value less than 1,628, and no PEP star donor status.

These two rules account for nearly half of the decision made by the neural network. Additional rules account for more of the target population.

You should experiment with accuracy versus tree size trade-offs (and other tree options) to achieve a description of the Joint Neural Network model that is both understandable and accurate.

Appendix A Exercises

A.1 Introduction to Predictive Modeling ............................................................................ A-3

A.2 Flexible Parametric Models .......................................................................................... A-9

A.3 Predictive Algorithms.................................................................................................. A-13

A-2 Appendix A Exercises

A.1 Introduction to Predictive Modeling A-3

A.1 Introduction to Predictive Modeling

A bank seeks to increase sales of a variable annuity product. To do this, the bank will send product offers to existing banking customers. To maximize profits, however, the bank wants to be selective about whom it targets. This selectivity will be achieved by constructing a predictive model.

To achieve their analytic objective, an analysis data set was assembled. The data set contains 10,619 records and 48 variables, assembled from several source tables within the bank’s data warehouse. The source tables include the customer master table, the transaction detail table, the product detail table, and a third party demographic overlay table. The variables describe each customer’s demographics and usage of other banking products prior to acquisition of the variable annuity. Two of the variables are nominally scaled; the remainder are binary or interval. A summary of the interval and binary variables in the analysis data set is provided by the MEANS procedure. The MEANS Procedure N Variable Label N Miss Mean Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ ACCTAGE Age of Oldest Account 9941 678 6.0103511 0.3000000 56.3000000 DDA Checking Account 10619 0 0.8148602 0 1.0000000 DDABAL Checking Balance 10619 0 2182.28 -399.5300000 259734.26 DEP Checking Deposits 10619 0 2.1306149 0 28.0000000 DEPAMT Amount Deposited 10619 0 2226.04 0 484893.67 CASHBK Number Cash Back 10619 0 0.0154440 0 2.0000000 CHECKS Number of Checks 10619 0 4.2642433 0 49.0000000 DIRDEP Direct Deposit 10619 0 0.2925888 0 1.0000000 NSF Number Insufficient Fund 10619 0 0.0840945 0 1.0000000 NSFAMT Amount NSF 10619 0 2.2192005 0 321.1000000 PHONE Number Telephone Banking 9286 1333 0.3877881 0 15.0000000 TELLER Teller Visits 10619 0 1.3919390 0 27.0000000 SAV Saving Account 10619 0 0.4699124 0 1.0000000 SAVBAL Saving Balance 10619 0 3215.08 0 609587.72 ATM ATM 10619 0 0.6022224 0 1.0000000 ATMAMT ATM Withdrawal Amount 10619 0 1205.71 0 127403.36 POS Number Point of Sale 9286 1333 1.0474908 0 43.0000000 POSAMT Amount Point of Sale 9286 1333 48.5870687 0 2933.83 CD Certificate of Deposit 10619 0 0.1230813 0 1.0000000 CDBAL CD Balance 10619 0 2441.60 0 613600.00 IRA Retirement Account 10619 0 0.0574442 0 1.0000000 IRABAL IRA Balance 10619 0 639.0896930 0 415656.63 LOC Line of Credit 10619 0 0.0637536 0 1.0000000 LOCBAL Line of Credit Balance 10619 0 1213.49 -613.0000000 367098.20 INV Investment 9286 1333 0.0318759 0 1.0000000 INVBAL Investment Balance 9286 1333 1013.93 0 1002678.08 ILS Installment Loan 10619 0 0.0512289 0 1.0000000 ILSBAL Loan Balance 10619 0 538.7629523 0 29162.79 MM Money Market 10619 0 0.1208212 0 1.0000000 MMBAL Money Market Balance 10619 0 1996.89 0 107028.55 MMCRED Money Market Credits 10619 0 0.0564083 0 5.0000000 MTG Mortgage 10619 0 0.0489688 0 1.0000000 MTGBAL Mortgage Balance 10619 0 7514.95 0 1628532.38 CC Credit Card 9286 1333 0.4802929 0 1.0000000 CCBAL Credit Card Balance 9286 1333 9254.36 -1903.99 1576808.43 CCPURC Credit Card Purchases 9286 1333 0.1515184 0 4.0000000 SDB Safety Deposit Box 10619 0 0.1128166 0 1.0000000 INCOME Income 8683 1936 40.6260509 0 233.0000000 HMOWN Owns Home 8774 1845 0.5410303 0 1.0000000 LORES Length of Residence 8683 1936 6.9982725 1.0000000 19.5000000 HMVAL Home Value 8683 1936 110.9008407 69.0000000 625.0000000 AGE Age 8478 2141 47.7059448 16.0000000 94.0000000 CRSCORE Credit Score 10373 246 665.9655837 509.0000000 807.0000000 MOVED Recent Address Change 10619 0 0.0267445 0 1.0000000 INAREA Local Address 10619 0 0.9623317 0 1.0000000


About half of the variables have some missing values. Many of the variables, especially those relating to monetary amounts, have an extremely large range and highly skewed distribution.

A summary of the nominal variables and the target variable (INS) is provided by the FREQ procedure. Insurance Product Cumulative Cumulative INS Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 6959 65.53 6959 65.53 1 3660 34.47 10619 100.00 Branch of Bank Cumulative Cumulative BRANCH Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ B1 922 8.68 922 8.68 B10 98 0.92 1020 9.61 B11 74 0.70 1094 10.30 B12 178 1.68 1272 11.98 B13 184 1.73 1456 13.71 B14 336 3.16 1792 16.88 B15 708 6.67 2500 23.54 B16 494 4.65 2994 28.19 B17 259 2.44 3253 30.63 B18 196 1.85 3449 32.48 B19 93 0.88 3542 33.36 B2 1744 16.42 5286 49.78 B3 920 8.66 6206 58.44 B4 1876 17.67 8082 76.11 B5 932 8.78 9014 84.89 B6 480 4.52 9494 89.41 B7 476 4.48 9970 93.89 B8 461 4.34 10431 98.23 B9 188 1.77 10619 100.00 Area Classification Cumulative Cumulative RES Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ R 2672 25.16 2672 25.16 S 3753 35.34 6425 60.50 U 4194 39.50 10619 100.00

The BRANCH variable, a nominal input with 19 distinct levels, indicates the branch in which the customer’s initial account was opened. The RES variable, a nominal input with three distinct levels, classifies the customer’s primary residence as rural, suburban, and urban.


The target variable for this analysis, INS, indicates acquisition of the variable annuity over a fixed period of time. While overall acquisition rate is about 2%, the acquisition rate in the raw analysis data is more than 34%. This reflects the separate sampling used to generate the raw data.

The bank expects to realize an average short-term revenue of about $100 from each customer who purchase the annuity product. It is expected to cost the bank about $4 per solicitation (which involves an initial mail solicitation with a telephone follow-up) to carry out the campaign.


Exercises

The following exercises relate to material in Chapter 1. The relevant section in the course notes is indicated for each question.

1. Establish the profit structure for the analytic objective defined above.

a. Which profit matrix conforms to the modeling objectives discussed above? [Section 1.5]

Decision

1 0 1 0 1 0 1 0

1 100 0 1 0 996 0 96 0 Actual

0 4 0 0 1 -4 0 -4 0

(a) (b) (c) (d)

b. Given the overall response rate of 2%, would any of the profit matrices above be nonconforming? [Section 1.5]

2. Construct a simple predictive modeling diagram in Enterprise Miner.

a. Create a new project called My Exercises. [Section 1.2]

b. Rename the default diagram Insurance. [Section 1.2]

c. Construct the following process flow diagram.

3. Configure the Input Data Source node.

a. Select the Insurance data set. [Section 1.2]

b. Change the Measurement of all ordinal variables to interval. [Section 1.3]

c. Change the Model Role of INS to target. [Section 1.3]

d. Create a Profit Matrix, as selected above, for the INS target variable. [Section 1.5]

e. Create a Prior Vector for the INS target variable. [Section 1.4]

f. Close and save changes to the node.


4. Configure the Data Partition node.

a. Partition half the data for Training and half for Validation. [Section 1.3]

b. Stratify on the target variable, INS. [Section 1.3]

c. Close and save changes to the node.


a. How many leaves does the selected tree have? [Section 1.3]

b. What is the validation profit for the selected tree? [Section 1.3]

6. View and interpret the tree diagram.

a. How many leaves are visible on the tree diagram? Does this match the number indicated in the Results window? [Section 1.3]

b. Change the viewing depth to display the entire tree. How many leaves are visible in the tree diagram now? [Section 1.3]

c. What are the rules defining the most profitable segment? How large is this segment?

d. What are the rules defining the least profitable segment? How large is this segment?

e. How many cases in the validation data have a decision equal to 1? What percentage of the total validation cases does this represent?

f. Close the Results window and save the changes.

7. Build a logistic regression model.

a. Add a Replacement and Regression node to the diagram as shown.

b. Why is the Replacement node required to build a logistic regression model? [Section 1.6]

c. Configure the Replacement node to add missing indicators. [Section 1.6]

d. Run the Regression node and view the results.

e. What is the overall average validation profit for the regression model? Is this higher or lower than the overall average validation profit for the tree model? [Section 1.6]

8. Modify the logistic regression model to perform stepwise variable selection. Re-run the node (this may take a few minutes) and view the results. [Section 1.7]


a. How does the overall average validation profit compare to the original logistic regression model that included every input?

b. What inputs were selected in the final model?

9. Compare the tree and regression models using an Assessment node.

a. Add an Assessment node to the diagram as shown.

b. What is the response proportion in the second decile as determined by both models? [Section 1.8]

c. What fraction of all individuals who actually opened a variable annuity account are found in the first two deciles of the regression model? [Section 1.8]

d. Based on profit considerations, what fraction of customers should be solicited using the regression model? [Section 1.8]

10. Score a new data set using the scoring recipe of the Regression model.

a. Add Input Data Source, Score and Code nodes as shown below.

b. In the newly added Input Data Source node, select the BANK_CUSTOMERS data set. Set the role of the data set to Score. [Section 1.9]

c. Configure the Score node to apply score code to score data set. [Section 1.9]

d. Open the SAS Code node and create a SAS view of the scored data. Be sure to modify variable names to correspond to the INSURANCE data set. [Section 1.9]

A.2 Flexible Parametric Models A-9

A.2 Flexible Parametric Models

The following exercises relate to material in Chapter 2. The relevant section is the course notes is indicated for each question.


Exercises

Open the My Exercise project and the Insurance diagram completed for Chapter 1. The following exercises continue the Chapter 1 analysis.

A copy of the Exercise Project is stored in the EM Project directory. If you choose, you may open this project and used the completed analysis for Chapter 1 exercises as a basis for the subsequent work in this chapter. Be advised, however, that the Insurance diagram in this project must be re-run to restore the diagram’s intermediate data.

Delete the nodes associated with scoring new data as shown below:

11. Add a regression node to the existing regression node from the Chapter 1 exercises. Name the new regression node Flexible Regression.

12. Open the Flexible Regression node and verify the rejected status of those inputs not selected in the Regression node.

A.2 Flexible Parametric Models A-11

If the Linear and Logistic Regression window does not look like the above, close the Flexible Regression node and run the Regression node built for the exercises of Chapter 1.

13. Domain knowledge suggests that certain non-linearities and interactions affect the relationships between the inputs and the target.

a. Use the interaction builder to add a quadratic modeling term DDABAL*DDABAL to the list of terms in the model. [Section 2.1]

b. Customers who have a certain investment product (INV=1) are more likely to obtain the variable annuity account. However, this dependence depends on factors relating to the sophistication of the banking customer. In addition to the non-linear term added above, add the following interactions to the list of terms in the model. [Section 2.1]

INV*MM INV*MMBAL INV*TELLER INV*ATMAMT INV*ACCTAGE

c. Close the interaction builder and run the logistic regression model. When complete, view the modeling results.

14. Has the overall average validation profit changed as compared to the standard regression model fit in the Chapter 1 exercises?

15. Modify the Flexible Regression node to use all two-way interactions and second order polynomial factors of inputs selected by the standard logistic regression model. [Section 2.1] IMPORTANT: To have the regression procedure finish in a reasonable amount of time, set the status of all interactions involving the BRANCH input to don’t use.

a. How does the overall average profit of the Flexible Regression model compare to that of previous models?

b. Can you make any conclusions about the importance of interactions and non-linearities in this modeling exercise?

16. Add a Neural Network node to the diagram as shown.


a. Configure a multilayer perceptron network with 5 hidden units. [Section 2.2]

b. Run the Neural Network node and view the results.

c. How does the overall average profit of the Neural Network compare to that of previous models?

17. Add a second neural network node to the diagram as shown. Name the new neural network node Stopped Training Network.

a. Using the Advanced user interface, configure a multi-layer perceptron network with 40 hidden units, initial weights scaled by 0.05, and training via the Double-Dogleg technique. [Section 2.3]

b. Run the Stopped Training Network node. Stop the training when validation error appears to increase.

c. View the results. How does the overall average profit of the Stopped Training Network compare to that of previous models?

A.3 Predictive Algorithms A-13

A.3 Predictive Algorithms

The following exercises relate to material in Chapter 3. The relevant section is the course notes is indicated for each question.


Exercises

Open the My Exercise project and the Insurance diagram completed for Chapters 1 and 2.

A copy of the Exercise Project is stored in the EM Project directory. If you choose, you can open this project and used the completed analysis for the Chapter 1 exercises as a basis for the subsequent work in this chapter. Be advised, however, that the Insurance diagram in this project must be re-run to restore the diagram’s intermediate data.

Explore the effects of changing the tree algorithm’s parameters. Add an additional Tree node to the diagram as shown. Label the node Tree Options.

18. What are the consequences of the following modifications to the Tree algorithm?

a. Change the Maximum number of branches from a node (Basic tab) to 5. What is the effect on profit and tree topology?

b. Change the Splitting Criterion to Gini reduction (Basic tab) while leaving the Maximum number of branches from a node at 5. In this case, the validation profit for this tree will be higher than the default settings. Is the topology simpler or more complex than the initial tree?

c. Change the Splitting Criterion back to Chi-square test (Basic tab). Select the Advanced tab and uncheck the Kass and Depth adjustments. Run the Tree node and compare to the previous exercises.

19. Use a decision tree to reduce the degrees of freedom associated with the input BRANCH. Add a decision tree to the Insurance diagram as shown. Label the node Consolidation Tree.

A.3 Predictive Algorithms A-15

a. Configure the Consolidation Tree to consolidate BRANCH. [Section 3.3]

b. Run the node and view the results. Using the settings of Section 3.3, all branch levels are consolidated into one node. Why?

c. While viewing the Tree-Results window, select View Assessment… Total leaf impurity (Gini index) from the menu bar. Using the total leaf impurity is a reasonable alternative to using Profit for grouping categorical levels.

d. How does this affect the results of the consolidation?

20. Add a Data Set Attributes node and a Regression node to the diagram. Label the Regression node Consolidation Regression.

a. Configure the Data Set Attributes node and the Consolidation Regression node to use the Consolidation Tree’s grouping of the BRANCH input in lieu of the orginal BRANCH input. [Section 3.3]

b. Run the Consolidation Regression node. How does the overall average profit compare to the original Regression node?


21. Use a Tree node to select inputs for a Neural Network. Connect a Tree node and a Neural Network node to the diagram as shown. Label the Tree node Selection Tree and the Neural Network node TS Neural Network.

a. Configure the Selection Tree node for input selection. [Section 3.3]

b. Use the selected inputs in a Neural Network with 6 hidden units. How does the profit of the TS Neural Network compare to other models?

Predictive Modelling Using E-Miner

Documents

Transcript of Predictive Modelling Using E-Miner