Periodic Model Monitoring Framework for Accurate …...business because of the need for accurate...
Transcript of Periodic Model Monitoring Framework for Accurate …...business because of the need for accurate...
Periodic Model Monitoring Framework
for Accurate Demand Forecasting
CONTENTS
1 INTRODUCTION .......................................................................................... 1
2 PERFORMANCE METRICS AND APPLICABILITY TO DIFFERENT MODELS ...... 2
3 MODEL MONITORING FRAMEWORK .......................................................... 4
3.1 DEVELOPING MODELS...................................................................................... 4
3.2 PERIODIC MONITORING ................................................................................... 4
3.2.1 Population Stability Index (PSI) ............................................................... 5
3.2.2 Receiver Operating Characteristic (ROC) and Lift Curve .......................... 8
3.2.3 Kolmogorov-Smirnov (KS) Test .............................................................. 11
3.2.4 Rank Order ........................................................................................... 12
4 MODEL MONITORING PLATFORMS .......................................................... 13
5 CONCLUSION ........................................................................................... 14
6 REFERENCES ............................................................................................. 15
1
www.corecompete.com
1 Introduction
Response or propensity models are widely used by businesses to
identify customers who are likely to purchase a product. To help develop these models, techniques like logistic regression and
decision trees have been deployed across verticals such as banking, telecom, and retail. Advanced algorithms, such as
random forest, gradient boosting techniques, and neural network models are also used for classifying outcomes, such as responder vs non-responder and good vs bad. Since propensity models are
extremely important to a business, a lot of attention is paid to statistical validation of the selected model before it is deployed.
Most importantly, even after the model has been validated and found to be robust, periodic monitoring is an imperative to ensure that the model is performing at peak efficiency over a course of
time. Ongoing monitoring is also required to determine whether changes in market conditions or business strategies demand
adjustment, redevelopment, or replacement of the model.
Several metrics are used to monitor the model’s performance and its validity. This paper discusses the different ways of confirming
the model’s validity. In addition, it attempts to answer the question of when to retire a model.
The use of response models is very important in the retail business because of the need for accurate demand forecasts. For example, over-forecasting may lead to higher inventory costs
while under-forecasting may lead to an inability to meet demand, which results in a loss of sales revenue. Retailers generate
forecasts at multiple levels including item, store, and day/week levels. Usually, when forecast models are developed, performance
metrics such as mean absolute percentage error (MAPE) or root mean square error (RMSE) are used to measure their accuracy.
A valid and accurate response model is a fundamental
requirement of a successful business. However, it has been observed that with time, forecast values start showing a higher
deviation with respect to the actual values than when the model was originally developed. For instance, Google Flu Trends,
“A valid and accurate
response model is
a fundamental
requirement of a
successful business”
2
www.corecompete.com
launched in 2008, is a classic example where a model was accurate in the initial years but started degrading over time,
especially between 2011 and 2013. The use of cross-sell models in banking is another example of a model that was accurate and
robust when it was first deployed to identify responders. It helped banks improve customer relationships and generate more business value. However, after 18 months, the same model was
no longer able to identify potential responders accurately.
In the case of Google Flu Trends, modifications in the Google
search algorithm meant that the data used for prediction had changed, producing unexpected results. The cross-sell models used in the banking sector were unable to remain reliable because
they could not keep up with the effect that socio-economic changes and technological advancements had on consumer
behavior.
In both cases, changes in external conditions led to deterioration
in the model’s performance. Do such situations sound familiar? This paper will attempt to address such challenges through a framework that can help track the performance of response
models and identify inefficiencies before they lead to negative outcomes.
2 Performance Metrics and Applicability to Different Models
Typically, performance metrics used to measure the accuracy and validity of a model depend on the type of model being tested. Table 1 provides guidelines for selecting performance
measurement metrics based on the type of model or business objective.
“Changes in external
conditions led to
a deterioration in
the model’s
performance”
3
www.corecompete.com
Table 1: Selecting Appropriate Performance Measurement Metrics
PERFORMANCE METRICS MODEL TYPE
ROOT MEAN SQUARED ERROR
FORECASTING MODEL
LINEAR REGRESSION MODEL
CONTINUOUS TARGET VARIABLE
MEAN ABSOLUTE PERCENTAGE ERROR
ROOT MEAN SQUARED LOGARITHMIC ERROR
MEAN ABSOLUTE ERROR
BIAS
CLASSIFICATION MODELS
SENSITIVITY
SPECIFICITY
AREA UNDER CURVE
LIFT CURVE
KOLMOGOROV-SMIRNOV (KS) - STATISTIC
Note: A detailed discussion of all the performance metrics mentioned in Table 1 is beyond the scope of this paper. As such, this paper primarily focuses on the performance metrics applicable to classification models.
4
www.corecompete.com
3 Model Monitoring Framework
3.1 Developing Models
The primary objective behind developing response models is to identify customers who have a high likelihood of responding to a cross-sell campaign or an event of interest. Developing a model
involves multiple stages, with extensive efforts directed towards maintaining accuracy and efficiency at every step in the process.
The stages of development are as follows:
1. Understanding the business problem and objective
2. Translating business objectives into analytical objectives
3. Identifying the data scope and time window.
4. Data exploration and preparation
5. Data treatment and transformation
6. Variable reduction and feature engineering
7. Model training and validation
8. Model selection and interpretation
9. Business approval
In this process, stages 5-9 involve multiple iterations. Once the model is finalized and approved, it is deployed and scored periodically.
3.2 Periodic Monitoring
Since response models are developed and deployed for use cases such as cross-sell campaigns and customer retention, they need
to be assessed periodically to measure their effectiveness over time. There are several performance metrics which can be used
to evaluate a model. These metrics help in answering various critical questions like:
• Is the model, which was developed previously, still usable?
• Has the model's performance deteriorated?
• Is now the right time to develop a challenger model?
“Response models…
…need to be
assessed periodically
to measure their
effectiveness over time”
5
www.corecompete.com
• Can the model still separate outcomes, such as bad vs good or responder vs non-responders?
One way to evaluate the performance of the model is by adopting a framework which measures a few key statistics at periodic
intervals. The proposed framework uses the following criteria to examine the validity of a model:
• Population Stability Index (PSI)
• Receiver Operating Characteristic (ROC) and Lift Curve
• Kolmogorov-Smirnov (KS) statistic
• Rank ordering of responders
This framework is intended to be used specifically for cross-sell and upsell models wherein campaign response data is sourced from multiple sections. With minor adjustments to the process, it
can be used to evaluate churn and risk models too. The adjustments are recommended since target information is not
sourced from campaign data in churn and risk models.
3.2.1 Population Stability Index (PSI)
PSI measures changes in the characteristics of the population over time. As models are based on historical datasets, it is
necessary to ensure that the characteristics of the present-day population are sufficiently similar to the historical population on which the model is based in order to accurately predict the
expected lift when used in a targeted campaign.
The formula for PSI is shown below, where 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖 is the % of
responders in the ith decile of the development dataset and 𝐴𝑐𝑡𝑢𝑎𝑙𝑖
is the % of observations in the ith decile of the scoring dataset.
This formula is used to calculate the PSI for each of the 10 deciles, and the total PSI is the summation of the individual PSIs from
each decile. A higher PSI indicates greater shifts in population.
𝑃𝑆𝐼 = ∑{(𝐴𝑐𝑡𝑢𝑎𝑙𝑖 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖)𝑙𝑛 (
𝐴𝑐𝑡𝑢𝑎𝑙𝑖𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖
)}
𝑛
𝑖=1
“One way to evaluate
performance…is by
adopting a framework
which measures
a few key statistics
at periodic intervals”
6
www.corecompete.com
The first step involved in computing the PSI is to score the current population using SAS® Enterprise Miner™. The next step is to
create the scoring table, following all the subset criteria and business decisions taken while developing the model. Next,
ensure that the current scoring base is connected to the score node of the SAS Enterprise Miner. The score takes two inputs - node or best scoring table and the selected model. Once scoring
is imported into the SAS Enterprise Miner, data assigned role should be "score". Role data can be changed from node
properties, if required.
Table 2: Process of Calculating PSI
Table 2 represents an example of the process followed to calculate the PSI. The summarized value of the PSI for all the deciles is 0.0148, which indicates that there is no significant shift in the population and therefore, no correction of the existing model is required The next course of action is based on the PSI value, as described in Table 3.
7
www.corecompete.com
Figure 1 illustrates how SAS Enterprise Miner can be used for scoring new data.
Figure 1: Using SAS Enterprise Miner for scoring new data
Table 3: Recommended actions based on the PSI value
PSI INFERENCE ACTION
< 0.1 INSIGNIFICANT CHANGE
NO ACTION REQUIRED
0.1-1.25 MINOR SHIFT IN POPULATION
CARRY OUT INVESTIGATION ON CHARACTERISTICS OF IMPORTANT VARIABLES.
MODEL CAN BE USED WITH MINOR ADJUSTMENT.
> 0.25 MAJOR SHIFT IN POPULATION
MAJOR INVESTIGATION IS REQUIRED OR REBUILD THE MODEL ON CURRENT DATA
AVAILABLE.
8
www.corecompete.com
3.2.2 Receiver Operating Characteristic (ROC) and Lift Curve
3.2.2.1 Understanding ROC Curve
The ROC curve (Figure 2) is a graphical plot which illustrates the ability of the models to classify binary events. Area Under Curve
(AUC) represents the measure of the quality of the classification model. The ROC curve is developed by plotting the true positive rate against the false positive rate by varying the threshold value
to classify binary events.
The ROC curve displays sensitivity and specificity for the entire
range of cut-off values. As the cut-off decreases, more and more cases are allocated to Class 1. Hence, sensitivity increases, and specificity decreases. A random classifier has an AUC of 0.5, while
AUC for a perfect classifier is equal to 1. In practice, most classification models have an AUC between 0.5 and 1. An AUC of
0.8 means that 80% of the time, a randomly selected case from the group with the event = 1 has a score greater than that for a randomly chosen case from the group with the event = 0.
Figure 2: ROC Curve
9
www.corecompete.com
Sensitivity and specificity are two additional parameters that need to be considered while evaluating model performance. They are
defined as follows:
• Sensitivity, also known as the true positive rate, is defined
as the ratio of the correctly classified true positive cases to total actual positive cases.
• Specificity, also known as true negative rate, is defined as
the ratio of correctly classified negative cases to the total actual negatives cases.
3.2.2.2 Understanding the Lift Chart
A Lift Chart measures the effectiveness of a classification model and is calculated as the ratio between results obtained with and without the model. The Lift Chart shows how much more likely
you are to receive positive responses by contacting customers suggested by the model as opposed to a random sample of
customers. For example, the Lift Curve can tell us that by contacting only 10% of the customers based on the model, we would reach 3.5 times as many respondents, as we would without
a model.
Table 4: Calculating the Lift Curve
10
www.corecompete.com
3.2.2.3 Plotting the ROC and the Lift Curve Chart
It is possible to use actual responses from a product cross-sell campaign to measure the effectiveness of the model regularly.
First, create a binary variable for responders using the campaign responses for all the scored and targeted customers. Then, the
dataset which has the final variables of the model and the actual responses from the campaign are used as test data in the SAS
Enterprise Miner to calculate the ROC, the Lift Curve, and other statistical metrics. This allows us to compare metrics across training, validation, and test datasets. If these values are very
close, then the model is effective and performing well.
The process for combining the latest scored data with actual
responses from a campaign and using it as test data is described below. It is important to note that this is refreshed data and not the data originally used to develop the model. Test data can be
created by:
1. Capturing the response data for the scored customers.
Include untargeted customers scored using the model.
Capture the data if they purchased the product without any
intervention.
2. Combining the scored data and corresponding actual
response data into one table. This table will have all the
significant independent variables used in the model scoring
and response data.
3. Importing this dataset as an additional data source into the
SAS Enterprise Miner and defining the role as raw or
training data.
4. Applying the exact transformation or variable treatment
which was applied on the training data during the model
development stage.
5. Adding a data partition node from the sampling tab of the
SAS Enterprise Miner, this time with the following settings
- train: 0%, validation: 0%, and test: 100%.
Combine the latest
scored data with
actual responses
from a campaign
and use it
as test data
Model Testing Process
11
www.corecompete.com
6. Adding a model comparison node after the selected model
node.
Figure 3: The ROC curve and Lift Curve produced on SAS Enterprise Miner
3.2.3 Kolmogorov-Smirnov (KS) Test
The KS statistic quantifies the distance between the empirical
distribution function of the sample and the cumulative distribution function of the reference distribution — or between the empirical distribution functions of two samples. It is a
measure of the degree of separation between the positive and negative distributions.
The KS-statistic is 100 if the model partitions the population into two separate groups in which one group contains all the events
and the other contains all the non-events. The higher the value, the better the model is at separating the events from non-event cases. The KS-statistic is calculated as the maximum value of
difference between cum event (%) and cum non-event (%).
As shown in Table 5, there are two ways of calculating the KS-
statistic:
• Split scored population into deciles (10 parts) ordered by decreasing predicted probability value. Then compute the
cumulative % of events and non-events in each decile and the difference between them. The maximum value of this difference is the KS-statistic.
12
www.corecompete.com
• Compute KS two sample test with PROC NPAR1WAY. It generates the difference metrics.
Table 5: Computing the KS-statistic value
Figure 4: The KS-statistic chart
3.2.4 Rank Order
Rank order is a simple concept wherein the event rate is observed
across deciles created using predicted probability values. The value of the event rate should be in decreasing order as we move
13
www.corecompete.com
from top deciles to bottom deciles. In model development phase, rank ordering of event rate is observed on the training and
validation datasets. If the event rate is not rank ordered in the training and validation datasets, it is not the best model. A rank
order table can be created by using the current analytical table as well as combined data of the response and scoring datasets. The event rate should be rank ordered in the new data set to pass
the model performance criteria.
Table 6: Rank order
Figure 6: Plotting rank order
It is important to note that there is a possibility of a break in rank ordering of the event rate. If this is towards the top deciles (1-6), there is a need to investigate. But if the break is towards the bottom deciles (7-10,) then it is not of significant concern.
4 Model Monitoring Platforms
In this paper, the model monitoring framework is illustrated using SAS Enterprise Miner — a market-leading predictive modelling
and machine learning solution. It enables data scientists to use advanced machine learning algorithms for business applications
with ease. In addition, SAS Model Manager can also be used for periodic model monitoring and champion vs challenger model decision-making. Rapid Minor and KNIME are a couple of other
model development products that are currently available.
The proposed model monitoring frameworks can be programmed,
customized, and implemented using open source programming languages such as Python, Spark, and R. These languages have
14
www.corecompete.com
rich libraries containing machine learning and data manipulation packages developed by the user and developer communities.
Core Compete has deep expertise in SAS technologies. SAS-certified data scientists at Core Compete are helping numerous
organizations across industries realize their optimal business value by deploying predictive and machine learning models. Core Compete has developed and deployed machine learning models
for clients in verticals such as retail, banking and financial services, telecom, and healthcare — across geographies including
the US, UK, Middle East, and South Asia.
5 Conclusion
Periodic model monitoring is an important exercise that will ensure that models maintain their accuracy and robustness over an extended period. Given that market conditions and consumer
behavior are rapidly evolving, model monitoring has become an essential part of data-driven decision-making. Since model
development requires significant effort and engagement with decision-makers, accurate and timely decisions on the production model are vital. The framework proposed in this paper has been
illustrated using SAS Enterprise Miner but can easily be developed and deployed on any other platform (Python, R etc.) with alert
mechanisms.
By Lokendra Devangan
Lokendra is a Manager of Data Science for Core Compete
Contact Us
For more information, email Core Compete at [email protected]
“Model monitoring has
become an essential
part of data-driven
decision-making”
15
www.corecompete.com
6 References
1. Gainers and Losers in Gartner 2018 Magic Quadrant for Data Science and Machine Learning Platforms (https://www.kdnuggets.com/2018/02/gartner-2018-mq-data-science-machine-learning-changes.html).
2. Getting Started with SAS Enterprise Miner 14.3. Copyright © 2017, SAS Institute Inc., Cary, NC, USA.
3. SAS® Model Manager 14.2: User’s Guide Copyright © 2016, SAS Institute Inc., Cary, NC, USA.
4. The Parable of Google Flu: Traps in Big Data Analysis. David Lazer, Ryan Kennedy, Gary King and Alessandro Vespignani. Science 14 Mar 2014: Vol. 343, Issue 6176, pp. 1203-1205.
5. https://en.wikipedia.org/wiki/Google_Flu_Trends.