Simplifyd analytics process-document_v1

SIMPLIFYD ANALYTICSPredictive Analytics: Process details

Sep 27, 2011

METHODOLOGYOVERVIEWDATA COLLECTIONDATA PREPARATIONMODELING & ANALYSISPERFORMANCE DIAGNOSTICS

BEST PRACTICES

REFERENCES

Intended for Knowledge Sharing

only2Intended for Knowledge Sharing only 2

CONTENTS



ANALYTICAL PROCESS OVERVIEW

Business Problem Characterization

Data Consolidation

Data Treatment

Modeling & Analysis*

Recommendations & Implementation

Strategy

● TRANSFORMATION : Conversion of field formats from other types to numeric● MISSING VALUE TREATMENTS : Imputation of missing values based on Mean, etc.● CAPPING TREAMENTS: Capping of extreme and nonsensical values● NORMALIZATION of all the variables to remove the effect of the distribution of variables on subsequent analytical steps

●The relationship between the variable of interest and the drivers has to be established with significant confidence and stability through mathematical modeling techniques like Regression/ Decision Trees, etc.

●Based on the understanding of the relationships between the events of interest and its drivers, suitable business strategies can be developed to address the business problem.

● TRANSLATION OF BUSINESS PROBLEM INTO A STATISTICAL FRAMEWORK: Decision on the analytical technique, data processing and final outcomes●HYPOTHESIZE the predictor variables’ relationships with the dependent variable * *

Note:* Modeling & Analysis is generally preceded by Clustering phase where all observations are grouped into homogenous clusters, similar in characteristics within, and dissimilar from the other clusters, to remove exogenous errors in findings* * DEPENDENT VARIABLE has to defined keeping in mind the business objective, data availability and forecast period.

● RECONCILIATION OF DATA FROM VARIOUS SOURCES into an Analysis Master Dataset



CONTENTS


BEST PRACTICES

REFERENCES



Data Specification Document

•Hypothesized predictor variables necessary for solving the business problem

•Availability of Data in the various data sources

•Form of the data

Data Integration Plan

•Reconciliation of the data from various sources into one single analysis master dataset

•Data Integration(DI) Report talks about presence of data across merged tables

Data Gap Analysis

•Information that was critical as per the hypothesis and not available in the data sources are listed down here so that this info can be captured in future

A thorough understanding of the data sources is essential to plan the extraction in the fastest and most efficient way….

….DI Report assumes significance in the later stages of Data preparation where the missing information because of unavailability of data has a particular meaning and so should not be imputed as others.

DATA COLLECTION

Master Dataset

NBPCLV Acxiom

Warehouse

Data

Analytics

Tables

Bureau

Data

• Customer

• Payments

• Click-Stream

• Cards

Subsequent Data Treatments

and Analytical Steps



CONTENTS


BEST PRACTICES

REFERENCES



….Variable reduction being very critical step to achieve best predictors for the subsequent analytical steps

DATA PREPARATION

Univariate Analysis Bivariate Analysis Variable Reduction

Certain Thumb Rules,

%Missing<= 5: Single Value Imputation

5<%Missing<=20: Bivariate based Value Imputation

20<%Missing<=40: Imputation based on Modeling with other independent variables

%Missing<=40 : Drop the variables

•Removal of extreme and non-sensicalvalues to achieve better distribution in the variables

•Variable transformation-log, exp etc forms depending on their degrees of relationship observed in bivariate plots.

• Dummy/binning variable creation depending on nature of relationship

•Selection/Dropping of variables based on the strength of relationship by trend and/or significance of chi-square test.

•Redundancy checks and removal by using indicators like VIF,CI and Factor Loading.

•Also business sense would be used in selection of variables for modeling

Data preparation begins with Data Distribution studies needed for missing and capping treatments ; followed by Data Sanity checks on groups of variables….

Missing Treatment

Capping Treatment

Variable Transformations

Selection/Dropping

Multi-collinearityChecks

PCA/FA/Varclus

•Selection/Dropping of based on the Factor Loadings of the variables on the significant PCs/Factors

?

?

?

?

?

?



Capping treatment is another critical treatment, where the non-sense and extreme observations are removed to achieve stability in parameter estimates….

….It should always precede Missing treatment, so that the imputed values for missing observations follow better distributions

DATA PREPARATION

Capping Treatment has to be consider,i. Distribution - if it’s a categorical variable then it should not be capped, etc.ii. Niche characteristics – If this outlier values explain a certain niche group of customers who have outliers in

other variables also, then they should not be cappediii. Business Information - Certain non-sense values signify something like Missing, etc., they should be capped to

another value nearest to the most sensible end values, but kept outside so that the actual information is not lost.

CAPPING TREATMENT

Back to Dataprep

Capping Treatment is necessary to remove the following two types of incidents,i. Outliers- Extreme observations in Dependent variable leading to high residuals in predictionsii. Influential Observations– Outliers in the independent variable side leading to unstable/wrong parameter

estimates

Am

ou

nt

Tra

nsa

cte

d $

Count of Transactions

Outliers

Influential Observations



Missing treatment is inevitable since the entire record is deleted if a certain variable has missing information….

….It’s also the most complex treatment, as each variable has to be treated differently based on its meaning, missing content and data integrity issues

DATA PREPARATION

No. Pyoffflg Prin0105 Loanamt Term Fixed Agnsttr Bbctrad Nummortt Rvoptbal Numminq Numminq3

1 0 2324.9 19900 360 1 21 282 1 282 0 0

2 0 3796.5 22100 240 0 6 6911 1 33978 1 1

3 1 12523.2 42000 360 1 1 36350 . 36732 1 1

4 0 5190.9 21760 349 1 42 885 1 911 0 0

5 1 53.6 18000 360 1 5 8851 1 9506 0 0

6 0 1256.9 15500 360 . 13 409 1 760 0 0

7 0 4403.3 25150 900 1 3 21417 5 23579 3 1

8 0 3137.2 17800 240 1 4 4528 2 5967 1 0

9 0 4256.5 9999999 360 1 9 18179 47 130683 4 1

10 0 6442.4 31200 360 1 34 33177 1 0 2 0

Missing observations

Unrealistic values

Missing Value Imputation has to be done based on,i. Meaning of the variable- for e.g., if flag, it can take either 1 or 0 depending on the coding; ii. Distribution - if it’s a continuous variable like Amount, etc. with lesser missing content then mean, etc.iii. If missing due to merging issue – then it depends on whether it was available in a particular table or not, for

e.g., if its not present in Restrictions table then missing “freq_restrictions” can take a value ‘0’ or if it was actually present in the Restrictions table but still has missing then it should take median value.

iv. Correlation with Other Predictors - Also the missing value in a variable can depend on other variables in a dataset, for e.g., if “amount_received” is missing, then the missing amount depends on the size of the merchant, the Average amount received in the prior months, type of products sold, industry avg, etc.

MISSING VALUE TREATMENT

Back to Dataprep

0

5

10

15

20

25

30

40 60 80 100

Mean

Txn

Am

t $

Mean MOB



Bivariate analysis explores the nature and degree of relationship between the independent and dependent variables ….

….and is necessary to achieve stable and accurate predictions apart from arriving at the correct recommendations

DATA PREPARATIONBIVARIATE ANALYSIS

Back to Dataprep

Dep Var = f(Indep Var, Log(Indep Var), Sin(Indep Var),….)

Significant estimate with large magnitude

Insignificant estimate

Transformations required

Bivariate Chart Analysis- Mean dep var value vs. Class

Dummy Creation for certain classes

Variable dropping if no trend or relationship

0

10

20

30

40

50

0 1 2 3 4

Mean

Txn

Am

t $

Mean Count of Restrictions

Dummy = (count_rest<=2)

No relationship



Multivariate analysis helps remove interrelationships between the predictors to achieve stable and correct estimates at individual variable level which is necessary for correct strategy creation….

….the variance/correlation based reductions are not mutually exclusive and might be applied judgmentally in different sequences to achieve the best set of predictors

DATA PREPARATIONMULTIVARIATE ANALYSIS

Back to Dataprep

Inter-correlations amongst Predictors

Linear relations Common Variances

Collinearity removal based on VIF and CI values

Total Variances

Factor Analysis Principal Component Analysis

Significant Predictors with Eigen Values >1 or which capture 70% variance

Variables should be grouped as per the information that they capture and reductions are performed at the group level

Factor loadings are used to decompose the significant Factors/PCs to variable level



CONTENTS


BEST PRACTICES

REFERENCES



Business need defines the nature of the dependent variable and the analysis time windows in which the predictors are observed and where the performance is observed.….

Note:*Population sizes and business dynamics have to be taken into account while deciding the Analysis Time windows and the form of dependent variable

Analysis windows*

Observation window Out-of-time Validation window

Performance Window

Dependent Variable captures the behavior of interest and it can be oContinuous or categorical oRaw or transformed(log, growth) , etc.

and the statistical technique used for analysis depends on the type and form of this variable

Observation Window stands for the window where the various predictors are observedPerformance Window stands for the time window where the dependent variable is definedOut-of-Time Validation Window stands for the time window where the model performance and stability is checked

Definition of Dependent Variable

MODELING & ANALYSISDEFINITION OF DEPENDENT VARIABLE



Every findings from analysis has to be validated for reliability and accuracy across samples of data.….

MODELING & ANALYSISA NOTE ON SAMPLING

Define the Population

Determine the Sampling Frame

Select Sampling Technique(s)

Determine the Sample Size

Execute the Sampling Process

SAMPLING TECHNIQUES

SIMPLE RANDOM SAMPLING

STRATIFIED SAMPLING

All records are randomly assigned a selectionprobability between 0 and 1.STRENGTHSEasily understood and implementedWEAKNESSESLower precision and no assurance ofrepresentativeness

All records are assigned to a particular sub-population, the proportion of which is to bemaintained in the final samples. SRS is used toselect records from the sub-populationsSTRENGTHSIncreases representativenessWEAKNESSESNot effective for large/small Stratas

….Nature of the business problem and population decides the sampling technique and sizes



Segmentation of customers into homogenous groups, identical within the clusters and different from those in other clusters, based on a set of behavioral characteristics..….

MODELING & ANALYSISSOME TIDBITS ABOUT CLUSTERING

….Identifies the structural breaks in the data, on either side of which the characteristics are fundamentally different, and hence is necessary to arrive at the real relationship of predictors with dependent variable

Most used methods of clustering:

Hierarchical Clustering- Assigns observations to a cluster progressively one at a time, based a distance measure.

Advantages: Good in case of small datasets as the algorithm finds the number of clusters.

Limitations: It fails with large datasets as a result of memory issues.

K-means Clustering- A random number of cluster origins are selected ;then all the remaining records are assigned to

one of them based on a distance measure.

Advantages: Simplicity and speed

Limitations: It does not yield the same result with each run, since the resulting clusters depend on the initial

random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global

minimum of variance.



Modeling is the establishment of a relationship between the variable of interest and its various predictors and hence the technique depends on the distribution of the dependent variable, the business problem and data quality and quantity available for modeling..….

MODELING & ANALYSISMODELING TECHNIQUES

….Findings of a stable and accurate model elucidates the degree and nature of the drivers of the dependent variable and thus defines the strategy to be taken for solving the business problem.

Final

Analysis

Dataset

Non-Parametric

Parametric

Does not depend on

distribution of

dependent variable

Depends on the

distribution of

dependent variable

Sl.No. Target Variable Distribution Modeling Approach Model Output

1 Continuous OLS RegressionA Typical Model :

Y = f(X)= f(X1, X2,..,Xn)2 Nominal Logistic Regression

3 Categorical positive values Poisson/Gamma

4 Unidentified Decision Trees Segments with increasing proportion of dependent variable.



Form of fitting function(how are they mathematically related?): y =α + β1X1 + β2X2

Predicted = Mean + relationship with Predictor 1*predictor 1+ relationship with Predictor 2*predictor 2

Assumption for the modeling: Residuals are independent, are normally distributed with ‘0’ mean and have uniform variance throughoutWhat is OLS? Ordinary Least Squares(Explained variance, R2 is being maximized)Type of Predicted (dependent) Variable: Continuous Variable (-∞ to + ∞)Business Question: What loan amount take off can we expect from a customer?SAS procedure: Proc RegPerformance Diagnostics (indicators of a good model):•R-square(-1 to +1): How good the model is explaining variance in predicted variable?•MSE(Mean Square Error): Size of average difference between predicted and actual?

MSE = sqrt of summation of (actual value – predicted value)/(count of obs)•Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001•Sign of parameter estimates: Should be intuitive or repeated in validation sample•Model validation: Model should be stable on both in time/out of time validation samples•Rank Ordering: Predict high value when actual is high and vice versa•AIC/SIC: Parsimony(or Efficiency)- min predictors, max predictions; compare across models

GENERALIZED LINEAR MODELSOLS REGRESSION (LINEAR)



GENERALIZED LINEAR MODELSOLS REGRESSION (LINEAR)- SAMPLE MODEL OUTPUT

The REG Procedure

Model: MODEL1

Dependent Variable: censor_po

Number of Observations Read 40162

Number of Observations Used 40162

Analysis of Variance

Source DF Sum of Mean F Value Pr > F

Squares Square

Model 12 610.91533 50.90961 219.02<.0001

Error 40149 9332.36401 0.23244

Corrected Total 40161 9943.27934

Root MSE 0.48212 R-Square 0.0614

Dependent Mean 0.5492 Adj R-Sq 0.0612

Coeff Var 87.78642

Parameter Estimates

Variable DF Parameter Standard t Value Pr > |t| Variance

Estimate Error Inflation

Intercept 1 1.24953 0.20693 6.04 <.0001 0

APPLICATION_PRIM_CB_SCR_NBR 1 -0.000216 0.00028377 -0.76 0.4465 1.0205

log_APPL_ADV_RATIO 1 -0.1166 0.0117 -9.96 <.0001 1.09417

log_APPL_PYMT_TO_INCOME_RATIO 1 -0.01966 0.00517 -3.8 0.0001 1.17587

Collinearity Diagnostics

Number Eigenvalue Condition Proportion of Variation

Index Intercept APPLICATION_P

RIM_CB_SCR_NB

R

log_APPL_ADV_

RATIO

log_APPL_PYMT

_TO_INCOME_RA

TIO

1 8.3631 1 0.00000188 0.00000202 0.00002708 0.00057815

2 1.01345 2.87264 8.65E-09 8.73E-09 1.04E-07 5.68E-06

3 0.96895 2.93787 2.42E-11 5.60E-14 1.68E-09 0.0000019

8 0.22138 6.14626 0.00000754 0.00000817 0.00009252 0.00396

9 0.20341 6.41212 0.00001611 0.00001745 0.00020511 0.01911

10 0.05087 12.82208 0.00000322 0.00000279 0.00011988 0.26143

11 0.02578 18.01153 0.00082432 0.00088072 0.00992 0.68574

12 0.00137 78.10783 0.01375 0.01859 0.96941 0.02085

13 0.00007104 343.097 0.98539 0.98048 0.02008 0.00000173



What is Logistic? Predicts log odds(event/non-event) Log (odds) = α + β1X1 + β2X2

Predicted probability of event = e^(α + β1X1 + β2X2)/(1+e^(α + β1X1 + β2X2))Predicted probability of non-event = 1/(1+e^(α + β1X1 + β2X2))

->Therefore, total probability (event + non-event) at an obs level is 1Type of Predicted (dependent) Variable: Binary (1/0)- one is event, other is ‘reference’Business Question: What is the probability of a customer defaulting?SAS procedure: Proc Logistic (with various link functions)Performance Diagnostics (indicators of a good model):•Concordance/Discordance: If all observations were paired randomly, in how many instances(%) is actual event observation given higher probability•Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001•Sign of parameter estimates: Should be intuitive or repeated in validation sample•Model validation: Model should be stable on both in time/out of time validation samples•Rank Ordering: Predict high value when actual is high and vice versa•Gains Chart(K-Statistic): Highest probabilities should be assigned to actual events •AIC: Parsimony(or Efficiency):min predictors, max predictions; compare across models

Note: *Hosmer-Lemeshow good but fails when model sample size is large

GENERALIZED LINEAR MODELSLOGISTIC REGRESSION



GENERALIZED LINEAR MODELSLOGISTIC REGRESSION - SAMPLE MODEL OUTPUT

The LOGISTIC Procedure

Model Information

Data Set MODOUT.TU60_VAL_FICO

_690_719_EXP

Response Variable outcome

Number of Response Levels 3

Model generalized logit

Optimization Technique Fisher's scoring

Number of Observations Read 607592

Number of Observations Used 607592

Response Profile

Ordered outcome Total

Value Frequency

1 0 597504

2 1 9432

Logits modeled use outcome=0 as the reference

category.

Model Fit Statistics

Criterion Intercept only Intercept &

Covariates

AIC 107549.99 106661.99

SC 107572.63 106956.24

-2 Log L 107545.99 106609.99

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 935.9990 24 <.0001

Score 902.4392 24 <.0001

Wald 892.8763 24 <.0001



GENERALIZED LINEAR MODELSLOGISTIC REGRESSION - SAMPLE MODEL OUTPUT contd…

Type 3 Analysis of Effects

Effect DF Wald Pr > ChiSq

Chi-Square

APPLICATION_PRIM_CB_ 2 14.5230 0.0007

log_APPL_ADV_RATIO 2 126.6605 <.0001

log_APPL_PYMT_TO_INC 2 83.5880 <.0001

Analysis of Maximum Likelihood Estimates

Parameter DF Development

Model Estimate

Validation

Model Estimate

Standard Wald Pr > Chi

SqError Chi-Square

Intercept 1 1.1321 -0.4085 0.8909 0.2102 0.6466

APPLICATION_PRIM_CB_ 1 -0.00349 -0.00220 0.00122 3.2494 0.0715

log_APPL_ADV_RATIO 1 -0.3934 -0.2839 0.0485 34.2834 <.0001

log_APPL_PYMT_TO_INC 1 -0.1206 -0.0900 0.0221 16.5920 <.0001

Odds Ratio Estimates

Effect outcome Point Estimate 95% Wald Confidence Limits

APPLICATION_PRIM_CB_ 1 0.998 0.995 1.000

log_APPL_ADV_RATIO 1 0.753 0.685 0.828

log_APPL_PYMT_TO_INC 1 0.914 0.875 0.954

Percent Concordant 65.9 Somers' D 0.338

Percent Discordant 32.1 Gamma 0.345

Percent Tied 2.0 Tau-a 0.074

Pairs 1806529536 c 0.669

Higher the percent

concordant, better

the model



GENERALIZED LINEAR MODELSLOGISTIC REGRESSION – RANK ORDERING OUTPUT contd…

predgr

oup obs minpred maxpred avgpred totact avgact cumact

predran

k cumpct actrank KS

1 12551 0.2069 1 0.275689 3632 0.289379 3632 1 22.51984 1 14.5573

2 13077 0.172384 0.206895 0.190708 2565 0.196146 6197 2 38.42386 2 21.07661

3 12932 0.163982 0.172383 0.165289 2179 0.168497 8376 3 51.93452 3 24.98741

4 12696 0.118382 0.163978 0.142257 1727 0.136027 10103 4 62.64261 4 25.9028

5 12814 0.096125 0.118381 0.105572 1360 0.106134 11463 5 71.07515 5 24.10965

6 12814 0.086392 0.096124 0.091463 1151 0.089824 12614 6 78.21181 6 20.83402

7 12814 0.077738 0.086391 0.081861 1061 0.0828 13675 7 84.79043 7 16.92002

8 11344 0.07317 0.077737 0.075261 811 0.071492 14486 8 89.81895 8 12.54508

9 14284 0.069614 0.073168 0.072034 894 0.062588 15380 9 95.3621 9 6.134163

10 12814 0.03382 0.069613 0.060393 748 0.058374 16128 10 100 10 0



0

20

40

60

80

100

120

0 20 40 60 80 100

CAPTURING OF THE MODEL

The column “cumpct” in the rank-ordering output indicates the no. of responders captured up

to the given decile.

The model captures about 22.5% responders in the first decile and about 71.07% of the

responders in the top 5 deciles.

Model

capturing

Random

capturing

Population (%)

Responders

captured

Higher the capturing

in the initial deciles,

better the model

performance

GENERALIZED LINEAR MODELSLOGISTIC REGRESSION – GAINS CHART contd…



CRITERIA FOR FINE-TUNING

CRITERION FOR FINE TUNING

The fine tuning is based on applying model for both development and validation samples. Following

criterion are consider for fine tuning the model.

Fine Tuning

Rank Ordering

Coefficient Stability

Concordance

Highest KS

Goodness-of-fit

Validation

Capturing



RECAP

Phase III

Decide on the number of models and identify the dependent variables for each model

Identify the statistical method suitable for each predictive model: OLS Regression, Logistic Regression etc.

Hypothesize Predictor variables

TRANSLATE THE BUSINESS PROBLEM INTO A STATISTICAL

PROBLEM BASED ON IBCVM FRAMEWORK

UNDERSTAND THE

BUSINESS PROBLEM

PREPARE DATA

SPECIFICATIONS

& GET DATA

MODEL IMPLEMENTATION

Prepare Scoring Code

Track model performance after regular

intervals

Redevelop/ Rebuild models on a need

basis

UNIVARIATE ANALYSIS

- Treatment of Outliers

BIVARIATE ANALYSIS

-Treatment of Missing Value

- Variable Transformations

DEVELOPMENT SAMPLE

(Sub sample of raw data)

MODEL DEVELOPMENT

-OLS / Logistic Regression

-Fine Tuning

VALIDATION SAMPLE

(Sub sample of raw data)

MULTIVARIATE ANALYSIS

- Removal of Multicollinearity

- Removal of Insignificant variables

RAW DATA

Model validationRefinement

based on

Client Feedback

VALIDATION SAMPLE

(out of time)

Phase II

Phase I



REMAINING SLIDES

PENDING SLIDES:OTHER TESTS(t tests, ANOVA, CHI-SQUARE, etc.)PITFALLS IN STATISTICS

SPURIOUS CORRELATIONENDOGENOUS & EXOGENOUS ERRORSACCURACY vs. RANKINGCAUSAL VS. CORRELATIONPOPULATION STABILITY INDEX

OTHER THINGS TO BE ADDED:

BEST PRACTICES DOCUMENTSAS & EXCEL MACROSREFERENCESSAMPLE DATA, CODE, OUTPUTCHEAT SHEET

Simplifyd analytics process-document_v1

Data & Analytics

Transcript of Simplifyd analytics process-document_v1