Post on 04-Jul-2015
description
SIMPLIFYD ANALYTICSPredictive Analytics: Process details
Sep 27, 2011
METHODOLOGYOVERVIEWDATA COLLECTIONDATA PREPARATIONMODELING & ANALYSISPERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
Intended for Knowledge Sharing
only2Intended for Knowledge Sharing only 2
CONTENTS
Intended for Knowledge Sharing
only3Intended for Knowledge Sharing only 3
ANALYTICAL PROCESS OVERVIEW
Business Problem Characterization
Data Consolidation
Data Treatment
Modeling & Analysis*
Recommendations & Implementation
Strategy
● TRANSFORMATION : Conversion of field formats from other types to numeric● MISSING VALUE TREATMENTS : Imputation of missing values based on Mean, etc.● CAPPING TREAMENTS: Capping of extreme and nonsensical values● NORMALIZATION of all the variables to remove the effect of the distribution of variables on subsequent analytical steps
●The relationship between the variable of interest and the drivers has to be established with significant confidence and stability through mathematical modeling techniques like Regression/ Decision Trees, etc.
●Based on the understanding of the relationships between the events of interest and its drivers, suitable business strategies can be developed to address the business problem.
● TRANSLATION OF BUSINESS PROBLEM INTO A STATISTICAL FRAMEWORK: Decision on the analytical technique, data processing and final outcomes●HYPOTHESIZE the predictor variables’ relationships with the dependent variable * *
Note:* Modeling & Analysis is generally preceded by Clustering phase where all observations are grouped into homogenous clusters, similar in characteristics within, and dissimilar from the other clusters, to remove exogenous errors in findings* * DEPENDENT VARIABLE has to defined keeping in mind the business objective, data availability and forecast period.
● RECONCILIATION OF DATA FROM VARIOUS SOURCES into an Analysis Master Dataset
Intended for Knowledge Sharing
only4Intended for Knowledge Sharing only 4
CONTENTS
METHODOLOGYOVERVIEWDATA COLLECTIONDATA PREPARATIONMODELING & ANALYSISPERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
Intended for Knowledge Sharing
only5Intended for Knowledge Sharing only 5
Data Specification Document
•Hypothesized predictor variables necessary for solving the business problem
•Availability of Data in the various data sources
•Form of the data
Data Integration Plan
•Reconciliation of the data from various sources into one single analysis master dataset
•Data Integration(DI) Report talks about presence of data across merged tables
Data Gap Analysis
•Information that was critical as per the hypothesis and not available in the data sources are listed down here so that this info can be captured in future
A thorough understanding of the data sources is essential to plan the extraction in the fastest and most efficient way….
….DI Report assumes significance in the later stages of Data preparation where the missing information because of unavailability of data has a particular meaning and so should not be imputed as others.
DATA COLLECTION
Master Dataset
NBPCLV Acxiom
Warehouse
Data
Analytics
Tables
Bureau
Data
• Customer
• Payments
• Click-Stream
• Cards
Subsequent Data Treatments
and Analytical Steps
Intended for Knowledge Sharing
only6Intended for Knowledge Sharing only 6
CONTENTS
METHODOLOGYOVERVIEWDATA COLLECTIONDATA PREPARATIONMODELING & ANALYSISPERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
Intended for Knowledge Sharing
only7Intended for Knowledge Sharing only 7
….Variable reduction being very critical step to achieve best predictors for the subsequent analytical steps
DATA PREPARATION
Univariate Analysis Bivariate Analysis Variable Reduction
Certain Thumb Rules,
%Missing<= 5: Single Value Imputation
5<%Missing<=20: Bivariate based Value Imputation
20<%Missing<=40: Imputation based on Modeling with other independent variables
%Missing<=40 : Drop the variables
•Removal of extreme and non-sensicalvalues to achieve better distribution in the variables
•Variable transformation-log, exp etc forms depending on their degrees of relationship observed in bivariate plots.
• Dummy/binning variable creation depending on nature of relationship
•Selection/Dropping of variables based on the strength of relationship by trend and/or significance of chi-square test.
•Redundancy checks and removal by using indicators like VIF,CI and Factor Loading.
•Also business sense would be used in selection of variables for modeling
Data preparation begins with Data Distribution studies needed for missing and capping treatments ; followed by Data Sanity checks on groups of variables….
Missing Treatment
Capping Treatment
Variable Transformations
Selection/Dropping
Multi-collinearityChecks
PCA/FA/Varclus
•Selection/Dropping of based on the Factor Loadings of the variables on the significant PCs/Factors
?
?
?
?
?
?
Intended for Knowledge Sharing
only8Intended for Knowledge Sharing only 8
Capping treatment is another critical treatment, where the non-sense and extreme observations are removed to achieve stability in parameter estimates….
….It should always precede Missing treatment, so that the imputed values for missing observations follow better distributions
DATA PREPARATION
Capping Treatment has to be consider,i. Distribution - if it’s a categorical variable then it should not be capped, etc.ii. Niche characteristics – If this outlier values explain a certain niche group of customers who have outliers in
other variables also, then they should not be cappediii. Business Information - Certain non-sense values signify something like Missing, etc., they should be capped to
another value nearest to the most sensible end values, but kept outside so that the actual information is not lost.
CAPPING TREATMENT
Back to Dataprep
Capping Treatment is necessary to remove the following two types of incidents,i. Outliers- Extreme observations in Dependent variable leading to high residuals in predictionsii. Influential Observations– Outliers in the independent variable side leading to unstable/wrong parameter
estimates
Am
ou
nt
Tra
nsa
cte
d $
Count of Transactions
Outliers
Influential Observations
Intended for Knowledge Sharing
only9Intended for Knowledge Sharing only 9
Missing treatment is inevitable since the entire record is deleted if a certain variable has missing information….
….It’s also the most complex treatment, as each variable has to be treated differently based on its meaning, missing content and data integrity issues
DATA PREPARATION
No. Pyoffflg Prin0105 Loanamt Term Fixed Agnsttr Bbctrad Nummortt Rvoptbal Numminq Numminq3
1 0 2324.9 19900 360 1 21 282 1 282 0 0
2 0 3796.5 22100 240 0 6 6911 1 33978 1 1
3 1 12523.2 42000 360 1 1 36350 . 36732 1 1
4 0 5190.9 21760 349 1 42 885 1 911 0 0
5 1 53.6 18000 360 1 5 8851 1 9506 0 0
6 0 1256.9 15500 360 . 13 409 1 760 0 0
7 0 4403.3 25150 900 1 3 21417 5 23579 3 1
8 0 3137.2 17800 240 1 4 4528 2 5967 1 0
9 0 4256.5 9999999 360 1 9 18179 47 130683 4 1
10 0 6442.4 31200 360 1 34 33177 1 0 2 0
Missing observations
Unrealistic values
Missing Value Imputation has to be done based on,i. Meaning of the variable- for e.g., if flag, it can take either 1 or 0 depending on the coding; ii. Distribution - if it’s a continuous variable like Amount, etc. with lesser missing content then mean, etc.iii. If missing due to merging issue – then it depends on whether it was available in a particular table or not, for
e.g., if its not present in Restrictions table then missing “freq_restrictions” can take a value ‘0’ or if it was actually present in the Restrictions table but still has missing then it should take median value.
iv. Correlation with Other Predictors - Also the missing value in a variable can depend on other variables in a dataset, for e.g., if “amount_received” is missing, then the missing amount depends on the size of the merchant, the Average amount received in the prior months, type of products sold, industry avg, etc.
MISSING VALUE TREATMENT
Back to Dataprep
0
5
10
15
20
25
30
40 60 80 100
Mean
Txn
Am
t $
Mean MOB
Intended for Knowledge Sharing
only10Intended for Knowledge Sharing only 10
Bivariate analysis explores the nature and degree of relationship between the independent and dependent variables ….
….and is necessary to achieve stable and accurate predictions apart from arriving at the correct recommendations
DATA PREPARATIONBIVARIATE ANALYSIS
Back to Dataprep
Dep Var = f(Indep Var, Log(Indep Var), Sin(Indep Var),….)
Significant estimate with large magnitude
Insignificant estimate
Transformations required
Bivariate Chart Analysis- Mean dep var value vs. Class
Dummy Creation for certain classes
Variable dropping if no trend or relationship
0
10
20
30
40
50
0 1 2 3 4
Mean
Txn
Am
t $
Mean Count of Restrictions
Dummy = (count_rest<=2)
No relationship
Intended for Knowledge Sharing
only11Intended for Knowledge Sharing only 11
Multivariate analysis helps remove interrelationships between the predictors to achieve stable and correct estimates at individual variable level which is necessary for correct strategy creation….
….the variance/correlation based reductions are not mutually exclusive and might be applied judgmentally in different sequences to achieve the best set of predictors
DATA PREPARATIONMULTIVARIATE ANALYSIS
Back to Dataprep
Inter-correlations amongst Predictors
Linear relations Common Variances
Collinearity removal based on VIF and CI values
Total Variances
Factor Analysis Principal Component Analysis
Significant Predictors with Eigen Values >1 or which capture 70% variance
Variables should be grouped as per the information that they capture and reductions are performed at the group level
Factor loadings are used to decompose the significant Factors/PCs to variable level
Intended for Knowledge Sharing
only12Intended for Knowledge Sharing only 12
CONTENTS
METHODOLOGYOVERVIEWDATA COLLECTIONDATA PREPARATIONMODELING & ANALYSISPERFORMANCE DIAGNOSTICS
BEST PRACTICES
REFERENCES
Intended for Knowledge Sharing
only13Intended for Knowledge Sharing only 13
Business need defines the nature of the dependent variable and the analysis time windows in which the predictors are observed and where the performance is observed.….
Note:*Population sizes and business dynamics have to be taken into account while deciding the Analysis Time windows and the form of dependent variable
Analysis windows*
Observation window Out-of-time Validation window
Performance Window
Dependent Variable captures the behavior of interest and it can be oContinuous or categorical oRaw or transformed(log, growth) , etc.
and the statistical technique used for analysis depends on the type and form of this variable
Observation Window stands for the window where the various predictors are observedPerformance Window stands for the time window where the dependent variable is definedOut-of-Time Validation Window stands for the time window where the model performance and stability is checked
Definition of Dependent Variable
MODELING & ANALYSISDEFINITION OF DEPENDENT VARIABLE
Intended for Knowledge Sharing
only14Intended for Knowledge Sharing only 14
Every findings from analysis has to be validated for reliability and accuracy across samples of data.….
MODELING & ANALYSISA NOTE ON SAMPLING
Define the Population
Determine the Sampling Frame
Select Sampling Technique(s)
Determine the Sample Size
Execute the Sampling Process
SAMPLING TECHNIQUES
SIMPLE RANDOM SAMPLING
STRATIFIED SAMPLING
All records are randomly assigned a selectionprobability between 0 and 1.STRENGTHSEasily understood and implementedWEAKNESSESLower precision and no assurance ofrepresentativeness
All records are assigned to a particular sub-population, the proportion of which is to bemaintained in the final samples. SRS is used toselect records from the sub-populationsSTRENGTHSIncreases representativenessWEAKNESSESNot effective for large/small Stratas
….Nature of the business problem and population decides the sampling technique and sizes
Intended for Knowledge Sharing
only15Intended for Knowledge Sharing only 15
Segmentation of customers into homogenous groups, identical within the clusters and different from those in other clusters, based on a set of behavioral characteristics..….
MODELING & ANALYSISSOME TIDBITS ABOUT CLUSTERING
….Identifies the structural breaks in the data, on either side of which the characteristics are fundamentally different, and hence is necessary to arrive at the real relationship of predictors with dependent variable
Most used methods of clustering:
Hierarchical Clustering- Assigns observations to a cluster progressively one at a time, based a distance measure.
Advantages: Good in case of small datasets as the algorithm finds the number of clusters.
Limitations: It fails with large datasets as a result of memory issues.
K-means Clustering- A random number of cluster origins are selected ;then all the remaining records are assigned to
one of them based on a distance measure.
Advantages: Simplicity and speed
Limitations: It does not yield the same result with each run, since the resulting clusters depend on the initial
random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global
minimum of variance.
Intended for Knowledge Sharing
only16Intended for Knowledge Sharing only 16
Modeling is the establishment of a relationship between the variable of interest and its various predictors and hence the technique depends on the distribution of the dependent variable, the business problem and data quality and quantity available for modeling..….
MODELING & ANALYSISMODELING TECHNIQUES
….Findings of a stable and accurate model elucidates the degree and nature of the drivers of the dependent variable and thus defines the strategy to be taken for solving the business problem.
Final
Analysis
Dataset
Non-Parametric
Parametric
Does not depend on
distribution of
dependent variable
Depends on the
distribution of
dependent variable
Sl.No. Target Variable Distribution Modeling Approach Model Output
1 Continuous OLS RegressionA Typical Model :
Y = f(X)= f(X1, X2,..,Xn)2 Nominal Logistic Regression
3 Categorical positive values Poisson/Gamma
4 Unidentified Decision Trees Segments with increasing proportion of dependent variable.
Intended for Knowledge Sharing
only17Intended for Knowledge Sharing only 17
Form of fitting function(how are they mathematically related?): y =α + β1X1 + β2X2
Predicted = Mean + relationship with Predictor 1*predictor 1+ relationship with Predictor 2*predictor 2
Assumption for the modeling: Residuals are independent, are normally distributed with ‘0’ mean and have uniform variance throughoutWhat is OLS? Ordinary Least Squares(Explained variance, R2 is being maximized)Type of Predicted (dependent) Variable: Continuous Variable (-∞ to + ∞)Business Question: What loan amount take off can we expect from a customer?SAS procedure: Proc RegPerformance Diagnostics (indicators of a good model):•R-square(-1 to +1): How good the model is explaining variance in predicted variable?•MSE(Mean Square Error): Size of average difference between predicted and actual?
MSE = sqrt of summation of (actual value – predicted value)/(count of obs)•Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001•Sign of parameter estimates: Should be intuitive or repeated in validation sample•Model validation: Model should be stable on both in time/out of time validation samples•Rank Ordering: Predict high value when actual is high and vice versa•AIC/SIC: Parsimony(or Efficiency)- min predictors, max predictions; compare across models
GENERALIZED LINEAR MODELSOLS REGRESSION (LINEAR)
Intended for Knowledge Sharing
only18Intended for Knowledge Sharing only 18
GENERALIZED LINEAR MODELSOLS REGRESSION (LINEAR)- SAMPLE MODEL OUTPUT
The REG Procedure
Model: MODEL1
Dependent Variable: censor_po
Number of Observations Read 40162
Number of Observations Used 40162
Analysis of Variance
Source DF Sum of Mean F Value Pr > F
Squares Square
Model 12 610.91533 50.90961 219.02<.0001
Error 40149 9332.36401 0.23244
Corrected Total 40161 9943.27934
Root MSE 0.48212 R-Square 0.0614
Dependent Mean 0.5492 Adj R-Sq 0.0612
Coeff Var 87.78642
Parameter Estimates
Variable DF Parameter Standard t Value Pr > |t| Variance
Estimate Error Inflation
Intercept 1 1.24953 0.20693 6.04 <.0001 0
APPLICATION_PRIM_CB_SCR_NBR 1 -0.000216 0.00028377 -0.76 0.4465 1.0205
log_APPL_ADV_RATIO 1 -0.1166 0.0117 -9.96 <.0001 1.09417
log_APPL_PYMT_TO_INCOME_RATIO 1 -0.01966 0.00517 -3.8 0.0001 1.17587
Collinearity Diagnostics
Number Eigenvalue Condition Proportion of Variation
Index Intercept APPLICATION_P
RIM_CB_SCR_NB
R
log_APPL_ADV_
RATIO
log_APPL_PYMT
_TO_INCOME_RA
TIO
1 8.3631 1 0.00000188 0.00000202 0.00002708 0.00057815
2 1.01345 2.87264 8.65E-09 8.73E-09 1.04E-07 5.68E-06
3 0.96895 2.93787 2.42E-11 5.60E-14 1.68E-09 0.0000019
8 0.22138 6.14626 0.00000754 0.00000817 0.00009252 0.00396
9 0.20341 6.41212 0.00001611 0.00001745 0.00020511 0.01911
10 0.05087 12.82208 0.00000322 0.00000279 0.00011988 0.26143
11 0.02578 18.01153 0.00082432 0.00088072 0.00992 0.68574
12 0.00137 78.10783 0.01375 0.01859 0.96941 0.02085
13 0.00007104 343.097 0.98539 0.98048 0.02008 0.00000173
Intended for Knowledge Sharing
only19Intended for Knowledge Sharing only 19
What is Logistic? Predicts log odds(event/non-event) Log (odds) = α + β1X1 + β2X2
Predicted probability of event = e^(α + β1X1 + β2X2)/(1+e^(α + β1X1 + β2X2))Predicted probability of non-event = 1/(1+e^(α + β1X1 + β2X2))
->Therefore, total probability (event + non-event) at an obs level is 1Type of Predicted (dependent) Variable: Binary (1/0)- one is event, other is ‘reference’Business Question: What is the probability of a customer defaulting?SAS procedure: Proc Logistic (with various link functions)Performance Diagnostics (indicators of a good model):•Concordance/Discordance: If all observations were paired randomly, in how many instances(%) is actual event observation given higher probability•Significance of parameter estimates: Prob of null hypotheses(no relationship) is <0.001•Sign of parameter estimates: Should be intuitive or repeated in validation sample•Model validation: Model should be stable on both in time/out of time validation samples•Rank Ordering: Predict high value when actual is high and vice versa•Gains Chart(K-Statistic): Highest probabilities should be assigned to actual events •AIC: Parsimony(or Efficiency):min predictors, max predictions; compare across models
Note: *Hosmer-Lemeshow good but fails when model sample size is large
GENERALIZED LINEAR MODELSLOGISTIC REGRESSION
Intended for Knowledge Sharing
only20Intended for Knowledge Sharing only 20
GENERALIZED LINEAR MODELSLOGISTIC REGRESSION - SAMPLE MODEL OUTPUT
The LOGISTIC Procedure
Model Information
Data Set MODOUT.TU60_VAL_FICO
_690_719_EXP
Response Variable outcome
Number of Response Levels 3
Model generalized logit
Optimization Technique Fisher's scoring
Number of Observations Read 607592
Number of Observations Used 607592
Response Profile
Ordered outcome Total
Value Frequency
1 0 597504
2 1 9432
Logits modeled use outcome=0 as the reference
category.
Model Fit Statistics
Criterion Intercept only Intercept &
Covariates
AIC 107549.99 106661.99
SC 107572.63 106956.24
-2 Log L 107545.99 106609.99
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 935.9990 24 <.0001
Score 902.4392 24 <.0001
Wald 892.8763 24 <.0001
Intended for Knowledge Sharing
only21Intended for Knowledge Sharing only 21
GENERALIZED LINEAR MODELSLOGISTIC REGRESSION - SAMPLE MODEL OUTPUT contd…
Type 3 Analysis of Effects
Effect DF Wald Pr > ChiSq
Chi-Square
APPLICATION_PRIM_CB_ 2 14.5230 0.0007
log_APPL_ADV_RATIO 2 126.6605 <.0001
log_APPL_PYMT_TO_INC 2 83.5880 <.0001
Analysis of Maximum Likelihood Estimates
Parameter DF Development
Model Estimate
Validation
Model Estimate
Standard Wald Pr > Chi
SqError Chi-Square
Intercept 1 1.1321 -0.4085 0.8909 0.2102 0.6466
APPLICATION_PRIM_CB_ 1 -0.00349 -0.00220 0.00122 3.2494 0.0715
log_APPL_ADV_RATIO 1 -0.3934 -0.2839 0.0485 34.2834 <.0001
log_APPL_PYMT_TO_INC 1 -0.1206 -0.0900 0.0221 16.5920 <.0001
Odds Ratio Estimates
Effect outcome Point Estimate 95% Wald Confidence Limits
APPLICATION_PRIM_CB_ 1 0.998 0.995 1.000
log_APPL_ADV_RATIO 1 0.753 0.685 0.828
log_APPL_PYMT_TO_INC 1 0.914 0.875 0.954
Percent Concordant 65.9 Somers' D 0.338
Percent Discordant 32.1 Gamma 0.345
Percent Tied 2.0 Tau-a 0.074
Pairs 1806529536 c 0.669
Higher the percent
concordant, better
the model
Intended for Knowledge Sharing
only22Intended for Knowledge Sharing only 22
GENERALIZED LINEAR MODELSLOGISTIC REGRESSION – RANK ORDERING OUTPUT contd…
predgr
oup obs minpred maxpred avgpred totact avgact cumact
predran
k cumpct actrank KS
1 12551 0.2069 1 0.275689 3632 0.289379 3632 1 22.51984 1 14.5573
2 13077 0.172384 0.206895 0.190708 2565 0.196146 6197 2 38.42386 2 21.07661
3 12932 0.163982 0.172383 0.165289 2179 0.168497 8376 3 51.93452 3 24.98741
4 12696 0.118382 0.163978 0.142257 1727 0.136027 10103 4 62.64261 4 25.9028
5 12814 0.096125 0.118381 0.105572 1360 0.106134 11463 5 71.07515 5 24.10965
6 12814 0.086392 0.096124 0.091463 1151 0.089824 12614 6 78.21181 6 20.83402
7 12814 0.077738 0.086391 0.081861 1061 0.0828 13675 7 84.79043 7 16.92002
8 11344 0.07317 0.077737 0.075261 811 0.071492 14486 8 89.81895 8 12.54508
9 14284 0.069614 0.073168 0.072034 894 0.062588 15380 9 95.3621 9 6.134163
10 12814 0.03382 0.069613 0.060393 748 0.058374 16128 10 100 10 0
Intended for Knowledge Sharing
only23Intended for Knowledge Sharing only 23
0
20
40
60
80
100
120
0 20 40 60 80 100
CAPTURING OF THE MODEL
The column “cumpct” in the rank-ordering output indicates the no. of responders captured up
to the given decile.
The model captures about 22.5% responders in the first decile and about 71.07% of the
responders in the top 5 deciles.
Model
capturing
Random
capturing
Population (%)
Responders
captured
Higher the capturing
in the initial deciles,
better the model
performance
GENERALIZED LINEAR MODELSLOGISTIC REGRESSION – GAINS CHART contd…
Intended for Knowledge Sharing
only24Intended for Knowledge Sharing only 24
CRITERIA FOR FINE-TUNING
CRITERION FOR FINE TUNING
The fine tuning is based on applying model for both development and validation samples. Following
criterion are consider for fine tuning the model.
Fine Tuning
Rank Ordering
Coefficient Stability
Concordance
Highest KS
Goodness-of-fit
Validation
Capturing
Intended for Knowledge Sharing
only25Intended for Knowledge Sharing only 25
RECAP
Phase III
Decide on the number of models and identify the dependent variables for each model
Identify the statistical method suitable for each predictive model: OLS Regression, Logistic Regression etc.
Hypothesize Predictor variables
TRANSLATE THE BUSINESS PROBLEM INTO A STATISTICAL
PROBLEM BASED ON IBCVM FRAMEWORK
UNDERSTAND THE
BUSINESS PROBLEM
PREPARE DATA
SPECIFICATIONS
& GET DATA
MODEL IMPLEMENTATION
Prepare Scoring Code
Track model performance after regular
intervals
Redevelop/ Rebuild models on a need
basis
UNIVARIATE ANALYSIS
- Treatment of Outliers
BIVARIATE ANALYSIS
-Treatment of Missing Value
- Variable Transformations
DEVELOPMENT SAMPLE
(Sub sample of raw data)
MODEL DEVELOPMENT
-OLS / Logistic Regression
-Fine Tuning
VALIDATION SAMPLE
(Sub sample of raw data)
MULTIVARIATE ANALYSIS
- Removal of Multicollinearity
- Removal of Insignificant variables
RAW DATA
Model validationRefinement
based on
Client Feedback
VALIDATION SAMPLE
(out of time)
Phase II
Phase I
Intended for Knowledge Sharing
only26Intended for Knowledge Sharing only 26
REMAINING SLIDES
PENDING SLIDES:OTHER TESTS(t tests, ANOVA, CHI-SQUARE, etc.)PITFALLS IN STATISTICS
SPURIOUS CORRELATIONENDOGENOUS & EXOGENOUS ERRORSACCURACY vs. RANKINGCAUSAL VS. CORRELATIONPOPULATION STABILITY INDEX
OTHER THINGS TO BE ADDED:
BEST PRACTICES DOCUMENTSAS & EXCEL MACROSREFERENCESSAMPLE DATA, CODE, OUTPUTCHEAT SHEET