EBI Assignment docx - Web viewThis assignment has shown us that several models could be used for...

33
ENTERPRISE BUSINESS INTELLIGENCE (BUS5EBI) SAS ENTERPRISE MINER ASSIGNMENT DATA SET ---- CREDIT DATE: 26/05/2014 BY:- MONIKA THEILIG- 17656167 Page 1

Transcript of EBI Assignment docx - Web viewThis assignment has shown us that several models could be used for...

ENTERPRISE BUSINESS INTELLIGENCE (BUS5EBI)

SAS ENTERPRISE MINER ASSIGNMENT

DATA SET ---- CREDIT

DATE: 26/05/2014

BY:-

MONIKA THEILIG- 17656167

Page 1

Table of Contents:

1) Introduction : Business Objective2) Assumptions3) Data Dictionary4) Project Bank Ozz – Analysis5) Unusual Data Values6) Cluster and Association Analysis7) Predictive Modelling8) Regression9) Neural Network10) Model Comparison 11) Business Implication12) Bibliography

Page 2

Introduction

Task 1: Business Objective

Bank Ozz is trying to distinguish between the qualities of its credit customers. Till now they have been offering the same interest rates to every credit customer. They would like to offer customers with a good credit history and a good credit rating, incentives to take out more loans or offer them more incentives to bank with them. They have realised that a good credit score puts their customers on a higher pedestal before any lender and this potential threat of their customers going elsewhere for better rates has led Bank Ozz to do this research. The bank wants to see which customers have opened Trade Lines in the last 24 months. They want to see which of these customers have had no bankruptcy indicators and no public derogatories. To add to this the bank wants to see how many of these customers have the Total Highest Credit all trades, in other words the highest loan amounts.

The objective is to offer these customers a better interest rate while offering them more credit facilities.

Assumptions

We have assumed that the data set CREDIT applies to a Bank. We assumed that Trade lines and Credit All Trades are defined as LOANS in our

analysis. We have assumed that the data set identifies the Banks’ customer lending

information. We have assumed that ID refers to a specific bank customer

Task 2 – Data Analysis and Definition

Page 3

DATA DICTIONARY

Name Full label Model Role

Measurement Level

Description

ID Customer ID Input Nominal Identification number of Customer

BanruptcyInd Bankruptcy Indicator

Target Binary Indicator of whether a customer has been bankrupt or not .

TLCnt03 Number Trade lines opened in 3 months

Input Nominal The number of credit lines the customers has opened in the last 3 months

TLCnt12 Number Trade lines opened 12 months

Input Nominal The number of credit lines the customers has opened in the last 12 months

TLCnt24 Number Trade Lines opened 24 months

Input Nominal The number of credit lines the customers has opened in the last 24 months

TLBadDerogCnt Number bad debts plus public derogatories

Input Nominal The number of bad debts and public derogatories/ judgements the customer has against their name

TLMaxSum Total High Credit all Trade Lines

Target Nominal The Total amount of debt that the Customer has with Bank Ozz

Project Bank Ozz Analysis

Page 4

We, the bank business analysts, created a new Project called Bank Ozz Gr18 in the SAS Enterprise miner Workstation. We created a New Library and Created a New Data Source called “Credit”.

We analysed the data set and found 30 variables and 3000 line items. For the purposes of our business objective we have decided to reject 24 of the Variables and use 6 variables as listed in the Data Dictionary above. These variables will be sufficient to provide us with the information necessary to answer all the questions presented by our client.

The Distribution of the variables is all defined as NOMINAL. This means that the variable is numeric but the order is not of importance. We needed to change 3 of the Variable distributions from Interval to Nominal. (as shown by arrows). The Bankruptcy Indicator Remains Binary and will play a decisive role in our analysis.

2.1 Unusual Data Values

1. Are there any unusual data values in any of your assigned input variables? Support your answer with an appropriate argument.

Page 5

We found some unusual values in several variables. These were dots rather than numbers. A dot cannot be used in the data analysis because it is not measurable or quantifiable. We need to deal with the unusual “dot “values by either replacing them or we need to filter them out.

2. List two possible strategies to handle cases with unusual values before attaching your desired analysis node. Explain the possible scenarios in which those strategies are appropriate.

Two possible strategies to handle unusual values are the Replacement and Filtering Tools in SAS.

Replacement Tool - Incorrect values can be replaced by more appropriate values. There is no loss of data.

Filtering Tool - Unwanted records can be excluded from the analysis. There is loss of data.

3. Are there missing values in any of the input variables.

We ran the StatExplore Node on the Data set and found that data was missing in the TLMaxSum variable. We have decided to use the Replacement tool for these. There will be no loss of data. And will still be able to include all these customers in our Analysis. After changing the missing variables to 1 (negligible value) we did a re-run on the replacement node and we found that 55 values were replaced.

Page 6

We then ran the Filter Node Tool for the customers that had missing and unusual values in the TLMaxSum Column. We decided to filter them out because we are only interested in

Page 7

customers that currently have loans with our Bank. We are still left with 2730 observations out of a possible 3000.

4. If you assigned a variable a rejected role, why?

We assigned 24 Variables a REJECTED role as these are not needed in the data set for the purpose of our analysis and to meet the target of our Objective as specified.

Task 3 . Cluster and Association Analysis

Page 8

5. What would happen if you do not standardise your inputs?

If inputs are not standardized , input variables with large ranges will play a dominant role in clustering.

Page 9

Variables with large variances tend to have more effect on the resulting clusters than variables with small variances. If all variables are measured in the same units (for example, dollars), then standardization might not be necessary. Otherwise, some form of standardization is recommended.

Usually standardization of the data is needed; the statistical distance (Mahalanobis distance) is preferred. Standardization of the data is needed if the range or scale of one variable is much larger or different from the range of others. This distance also compensates for inter-correlation among the variables. Often one sums across the within-groups sum-of-products matrices to obtain a pooled covariance matrix for use in statistical distance.

6.Using the results of the Segment Profile node, interpret the characteristics of the first three biggest clusters

Unsupervised Classification is also known as clustering. This form of segmentation attempts to group data based on similarities in input variables. The data is represented by a small number of clusters.One of the most common methods used in clustering is the k-means algorithm. It is the primary clustering tool in SAS Enterprise miner.We have Selected Standardization for Internal Standardization in the cluster node and this will standardize all the input variables.

The Results - Cluster window contains four embedded windows. The Segment Plot window attempts to show the distribution of each input variable by

cluster. The Mean Statistics window lists various descriptive statistics by cluster. The Segment Size window shows a pie chart describing the size of each cluster

formed. The Output window shows the output of various SAS procedures run by the Cluster

node

Page 10

.

We had to select most of our variables in order to get at least 3 meaningful clusters. Choosing meaningful inputs makes cluster interpretation easier. Low correlation between input variables produces more stable clusters. Class input variables have a tendency to dominate cluster formation. Low kurtosis and skewness reduces the chance of creating small outlier clusters.

The data used in this analysis is a random sample of 3000 bank customers. The response (or target) variable BANRUPCYID indicates whether a customer became Bankrupt during the performance period of 24 months. The customer cluster no 3 represented a 47% bankruptcy rate while the second cluster represented a 27% bankruptcy rate and the 3rd

cluster represented a 22 % bankruptcy rate. Variables with little relationship to the target were excluded from the analysis.

The cluster node found 4 clusters in the CREDIT data set.

Page 11

7. Why was cluster analysis chosen?

Cluster Analysis was chosen because we are trying to find the cluster of bank customers with the highest Bankruptcy rates within the last 24 months. The Variables are all values and there are no transactions.

Transaction data would be necessary for Association analysis as Association analysis requires the grouping of products with similar characteristics. For our purposes we are trying to find clusters of customers with high bankruptcy rates to exclude them form our better interest rate offering, while identifying customers with NO bankruptcy indicators to offer them better interest rates and possibly more credit.

This is why we chose cluster analysis and also the obvious fact that we cannot use Association Analysis as we do not have transaction data in our set.Regression Analysis and Decision Trees will still be our best tool for our Binary Mode indicator, whether a customer has had a bankruptcy or not.

Page 12

Association Analysis

Our Data set does not have any transactions , so it is not suitable for Market Basket or association Analysis. In fact the data set would not allow us to run an association analysis.A print screen shows the error message below.

Page 13

Task 4: Predictive Modelling

Decision Trees

Page 14

Page 15

The Optimal Tree generated by the decision tree node is in the bottom left hand corner. The decision Tree generated is optimised for decisions regarding which customers have Bankruptcy indicators and those that do not.

To identify the probability of a customer defaulting on payments and being declared Bankrupt, we would have to generate a tree that is optimised for probability estimates. We would do this by using the Average Square error as the assessment measure.

Page 16

4.1 Questions

1. Why was the Target Variable assigned that Variable Role?

The BancrupcyID was assigned the Target Variable Role because it is the Binary indicator that tells us whether a customer has gone into Bankruptcy or not. This will indicate a yes or no solution to our objective.

2. How many leaves are there in the optimal tree created in step (iv)? Which variable was used for the first split and explain why this variable was chosen over others?

There are 8 leaves created. The Average Square Error is 0.1098 The bancrupcyID indicator was used for the first split as this is the identifying variable on which all other decisions will be based.

Page 17

3. How many leaves are there in the optimal tree created in step (ix)?

There are 12 leaves in the Optimal Tree. The average square error in the training data set is 0.0953 and in the training data it is 0.1127.

Page 18

4. Which of the decision tree models appear to be better?

a) Based on Average Squared error on training Data? b) Based on Average Square error on validation data?

The Model shows that the maximal , 12 leaf tree, generates a lower misclassification rate than the previous ones. The plot on Training data indicates that the maximal tree is preferred for assigning prediction to cases. The Train data in the maximal tree shows an average square error of only 0.0953 which is significantly lower than the first decision tree which showed an average square error of 0.1098.The validation data square error is higher at 0.1127, it is the highest and therefore the least optimal.

5. Regression

After running the StatExplore node, the results show that several values are missing in the selected variables. (as per table above). The Impute node replaces any missing values with the most frequent category (for categorical inputs). These are acceptable default values.

Page 19

6) In preparation for regression , is any missing values imputation needed? If, yes , should you do this imputation before generating the decision tree models?

By imputing, you create a synthetic value for the missing values. If an interval input contains a missing value, replace the missing value with the mean of the non-missing values for the input. This eliminates the incomplete case problem but modifies the input’s distribution. Imputation should therefore be done before generating the decision trees to avoid any bias in the model.

7) Which variables are included in the final regression model generated. List the variables in descending order of importance to the model

Page 20

The initial lines of the Output window summarize the roles of Variables used (or not) by the Regression node. I had to use all the variables for my regression analysis to get the best fit model. When I eliminated most of the variables to begin with the data partition node could not get a good fit between the training and the validation data and the regression run gave me an error. On trying several more variables I could get a good regression analysis, while still using the Bankrupcy indicator as the target.The fit model has 27 inputs that predict a binary target.The variables in descending order of importance are:

1)2)3)

8) Which variables are included in the final regression model generated in the last step?

The Stepwise Selection Summary below shows the step in which each input was added and the statistical significance of each input in the model.The default selection criterion selects the optimal model. (Please see stepwise diagram below

The variables in descending order of importance are:1)

2) 3) 4)

Page 21

9) Based on Average Squared error on the validation data, which of the two regression models generated appear to be better?

Based on Average square error on Validation data alone , the 2 Regression models were identical. Both had an Average Square error of 0.118474.

6. Neural Networks

Page 22

10. How many weights does the neural network model generated in step (xvii) include?

The model contains 280 weights. This is a large model .

11) Examine the validation average squared error of the neural network model . How does it compare to the 2 decision tree models and the regression model generated after applying log transformations?

The Average Square error on the Validation data using the Neural Network mode is 0.076643. It is lower than in the Regression models which had an Average Square error of 0118474. It is also lower than the Decision Tree Validation data Average Square error of 0.1127.

12.) Examine the results of the Model Comparison node. Of the predictive models compared, which model has been selected by the model comparison node? Based on what selection criteria has this model been selected?

We ran the Model Comparison node. The model that was selected as the best fit is the Neural Network Model based on Misclassification rate on Validation data. The Output screen below shows the evidence of the Fit Statistics.

Page 23

13) Change the default values of the Model Comparison node properties so that it selects the model having the least average squared error on the validation data. Run the Model Comparison node again. Which model has been selected now?

After changing the selection statistic to the least Average Square error and re-running the Model Selection node we found that the Regression model outperformed any of the other models.

Page 24

14) Why are the models compared?

We want to see how a Decision tree model compares to Regression analysis and in turn how a Neural Network model compares to Regression Analysis. We also want to see how these compare to analysing clusters in our data set.

The models are compared to give you the best fit model for the purposes of answering our questions for our objective. If you select a default selection statistic the model selection node will run through all the algorithms to find the algorithm that fits the best. There are many algorithms that could be used, depending on your objective criterion. These include Average Squared Error, Misclassification rate, Mean Squared Error, ROC, Captured response and many more.

Page 25

BUSINESS IMPLICATION

1) From the outcome of your analysis of the data set and the business case you have come up with , what can you deduce, recommend and conclude?

We have run several models to analyse our data set. These include Cluster Analysis, Decision Trees, Regression Analysis and a Neural Networks. We had a look at several variables in different combinations in the different model roles to see which combination and which model would produce optimal results. These models needed to give us the best results especially when looking at Validation data and Training Data in equal splits. They also needed to optimally reduce skewness and redundancy of our data through standardisation and imputing, filtering or replacing incorrect or missing values.We have maintained BancrupcyID as our TARGET and binary decision variable throughout this analysis.

We needed to analyse the variables based on the clients that had taken out loans over the various periods in the last 2 years. We had to pair these with the clients that had the highest loans (Trade Lines) with our bank and then see which of these did not have any Bakrupcy indicators against their names.

For our purposes we found that the Decision Tree answered our questions the fastest while providing quick and easy visual analysis. It did not have the lowest Average mean square error but it was low enough to be satisfactory and the ease of analysing the quick response to our objective made it Optimal because it allowed us to provide the bank with a quick and inexpensive response to their questions.

The decision tree also showed an almost perfect split and use of the Training and Validation data. This shows that our data set has integrity.

Page 26

Our results provided the bank with the following Information:

Customers with Bancrupcy Indicators: 15.3416 % (validation data)Customers without Bancrupcy indicators: 84.6564 % (validation data)Customers with the highest Credit (loan) amounts in Dollars who have taken loans out in the last 2 years without a BancrupcyID :

Percentage: 84.6564 % (validation data) No of customers out of our data set (out of a total of 3000) : 1269Customers with the highest Credit (loan) amounts in Dollars who have taken loans out in the last 2 years with a BancrupcyID:

Percentage: 15.3436 % (validation data) No of customers out of our data set (out of a total of 3000) : 230

The individual names of these customers can be provided through the customer ID and then drilling down to the Bank’s data source systems.This information will allow the bank to offer these clients a better interest rate and grow their business by offering them more loans, which will increase revenue and therefore profits .

Page 27

Page 28

2) What are the business implications that can be drawn from the process of building and comparing these models, and has this practice helped resolve the business issue? Why or why not?

Time is Precious and So are your models. – SAS philosophyOrganizations spend a significant amount of time, often too much, operationalizing models. The more time you can spend on analytics, and the less time on deployment headaches, the better chance you have to address core business challenges.( Jonathan Wexler, Wayne Thompson, and Kristen Aponte, SAS Institute Inc ).

After you have developed one or more suitable models, you can create a model package that can be shared with fellow decision makers, business managers, and IT staff. SAS Model Manager enables you to validate the performance of models and then promote them to a production environment.You can compare the models and thus compare the input, output, and target variable attributes for each of the variables that are used to score the champion model to best answer the objective at hand.From a business point of view the “champion” model would provide a solid answer to the objective at hand in the quickest and most cost effective way.This assignment has shown us that several models could be used for our analysis . It also provided an optimal solution based on a very low Average Square error or Missclassification rate. A lot more variables had to be used to get more optimal results in some of the models. The process of analysing the models has definitely helped us understand the complexities of the various algorithms that the models use to make optimal choices.We feel that comparing the models has definitely led us to solve the business issue at hand and we were able to select the most optimal model quickly and with relative certainty that our decision could fit well with the bank’s objective and also be implemented in a live environment.Ozz bank was quick informed as to the percentages of clients and the number of clients from their database that had the largest loan amounts, taken out over the last 24 months. They could immediately see which of these clients had a BancrupcyID against them and those that did not.This will allow the bank to offer these clients a better interest rate and grow their business by offering them more loans, which will increase revenue and thereby profit for the bank.

Page 29

BIBLIOGRAPHY:

1) SAS Enterprise Miner Manual and Help guides 2) Website : http://support.sas.com/resources/papers/proceedings13/086-2013.pdf 3) Enterprise Business Intelligence SAS Enterprise Miner 13.1 Workbooks 4) Data Mining Using SAS Enterprise Miner – A Case Study 5) Getting Started with SAS Enterprise Miner 13.1

Page 30