Download - Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

Transcript
Page 1: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

1

Customer Lifetime Value Measurement using Machine Learning

Techniques

Tarun Rathi

Mathematics and Computing

Department of Mathematics

Indian Institute of Technology (IIT), Kharagpur -721302

[email protected]

Project guide: Dr. V Ravi

Associate Professor, IDRBT

Institute of Development and Research in Banking Technology (IDRBT)

Road No. 1, Castle Hills, Masab Tank, Hyderabad – 500 057

http://www.idrbt.ac.in/

July 8, 2011

Page 2: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

2

Certificate

Date: July 8, 2011

This is to certify that the project Report entitled “Customer Lifetime Value

Measurement using Machine Learning Techniques” submitted by Mr. TARUN

RATHI, 3rd

year student in the Department of Mathematics, enrolled in its 5

year integrated MSc. course of Mathematics and Computing, Indian Institute of

Technology, Kharagpur is a record of bonafide work carried out by him under

my guidance during the period May 6, 2011 to July 8, 2011 at Institute for

Development and Research in Banking Technology (IDRBT), Hyderabad.

The project work is a research study, which has been successfully completed as

per the set of objectives. I observed Mr. TARUN RATHI as sincere, hardworking

and having capability and aptitude for independent research work.

I wish him every success in his life.

Dr. V Ravi

Associate Professor, IDRBT

Supervisor

Page 3: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

3

Declaration by the candidate

I declare that the summer internship project report entitled, “Customer

Lifetime Value Measurement using Machine Learning Techniques” is my own

work conducted under the supervision of Dr. V Ravi at Institute of

Development and Research in Banking Technology, Hyderabad. I have put in 64

days of attendance with my supervisor at IDRBT and awarded project

fellowship.

I further declare that to the best of my knowledge the report does not contain

any part of any work, which has been submitted for the award of any degree

either by this institute or in any other university without proper citation.

Tarun Rathi

III yr. Undergraduate Student

Department of Mathematics

IIT Kharagpur

July 8, 2011

Page 4: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

1

Acknowledgement

I would like to thank Mr B. Sambamurthy, director of IDRBT, for giving me this

opportunity.

I gratefully acknowledge the guidance from Dr. V. Ravi, who helped me sort

out all the problems in concept clarifications; and without whose support, the

project would not have reached its present state. I would also like to thank Mr.

Naveen Nekuri for his guidance and sincere help in understanding important

concepts and also in the development of the WNN software.

Tarun Rathi

III yr. Undergraduate Student

Department of Mathematics

IIT Kharagpur

July 8, 2011

Page 5: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

2

Abstract: Customer Lifetime Value (CLV) is an important metric in relationship marketing

approaches. There have always been traditional techniques like Recency, Frequency and

Monetary Value (RFM), Past Customer Value (PCV) and Share-of-Wallet (SOW) for

segregation of customers into good or bad, but these are not adequate, as they only

segment customers based on their past contribution. CLV on the other hand calculates the

future value of a customer over his or her entire lifetime, which means it takes into account

the prospect of a bad customer being good in future and hence profitable for a company or

organisation. In this paper, we review the various models and different techniques used in

the measurement of CLV. Towards the end we make a comparison of various machine

learning techniques like Classification and Regression Trees (CART), Support Vector

Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, Multilayer

Perceptron (MLP) and Wavelet Neural Network (WNN) for the calculation of CLV.

Keywords : Customer lifetime value (CLV), RFM, Share-of-Wallet (SOW), Past Customer

Value (PCV), machine learning techniques, Data mining, Support Vector Machines,

Sequential Minimal Optimization (SMO), Additive Regression, K-star Method, Artificial

Neural Networks (ANN), Multilayer Perceptron (MLP), Wavelet Neural Network (WNN).

Page 6: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

3

Contents

Certificate

Declaration by the candidate

Acknowledgement 1

Abstract 2

1. Introduction 4

2. Literature Review 5

2.1 Aggregate Approach 5

2.2 Individual Approach 8

2.3 Models and Techniques to calculate CLV 10

2.2.1 RFM Models 10

2.2.2 Computer Science and Stochastic Models 12

2.2.3 Growth/Diffusion Models 15

2.2.4 Econometric Models 15

2.2.5 Some other Modelling Approaches 17

3. Estimating Future Customer Value using Machine Learning Techniques 19

3.1 Data Description 19

3.2 Models and Software Used 20

3.2.1 SVM 20

3.2.2 Additive Regression and K-Star 21

3.2.3 MLP 22

3.2.4 WNN 22

3.2.5 CART 24

4. Results and Comparison of Models 27

5. Conclusion and Directions of future research 28

References 29

Page 7: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

4

1. Introduction: Customer Lifetime Value has become a very important metric in Customer

Relationship Management. Various firms are increasing relying on CLV to manage and

measure their business. CLV is a disaggregate metric that can be used to find customers who

can be profitable in future and hence be used allocate resources accordingly (Kumar and

Reinartz, 2006). Besides, CLV of current and future customers is a also a good measure of

overall value of a firm (Gupta, Lehmann and Stuart 2004).

There have been other measures as well which are fairly good indicators of customer

loyalty like Recency, Frequency and Monetary Value (RFM), Past Customer Value (PCV) and

Share-of-Wallet (SOW). The customers who are more recent and have a high frequency and

total monetary contribution are said to be the best customers in this approach. However, it

is possible that a star customer of today may not be the same tomorrow. Matlhouse and

Blattberg (2005) have given examples of customers who can be good at certain point and

may not be good later and a bad customer turning to good by change of job. Past Customer

Value (PCV) on the other hand calculates the total previous contribution of a customer

adjusted for time value of money. Again, PCV also does not take into account the possibility

of a customer being active in future (V. Kumar, 2007). Share-of-Wallet is another metric to

calculate customer loyalty which takes into account the brand preference of a customer. It

measures the amount that a customer will spend on a particular brand against other brands.

However it is not always possible to get the details of a customer spending on other brands

which makes the calculation of SOW a difficult task. A common disadvantage which these

models share is the inability to look forward and hence they do not consider the prospect of

a customer being active in future. The calculation of the probability of a customer being

active in future is a very important part in CLV calculation, which differentiates CLV from

from these traditional metrics of calculating customer loyalty. It is very important for a firm

to know whether a customer will continue his relationship with it in the future or not. CLV

helps firms to understand the behaviour of a customer in future and thus enable them to

allocate their resources accordingly.

Customer Lifetime Value is defined as the present value of all future profits obtained

from a customer over his or her entire lifetime of relationship with the firm (Berger and

Nassr, 1998). A very basic model to calculate CLV of a customer is (V. Kumar, 2007) :

���� = ∑ ���� �������� ��������� � ���� ��������������

where, � is the customer index, � is the time index,

T is the number of time periods considered for estimating CLV, is the discount rate.

Page 8: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

5

There are various models to calculate the CLV of a customer or a cohort of

customers, depending on the amount of data available and the type of company. V. Kumar

(2007) has shown individual level approach and aggregate level approach to calculate CLV.

He has linked CLV to Customer Equity (CE) which is nothing but the average CLV of a cohort

of customers. Dwyer (1997) have used a customer migration model to take into account the

repeat purchase behaviour of customers. Various behaviour based models like logit-models

and multivariate Probit-models have also been used (Donkers, Verhoef and Jong, 2007) and

models which takes into account the relationship between various components of CLV like

customer acquitition and retention are also used (Thomas 2001). We will present some of

the most used models to calculate CLV in the later part of the paper. Besides this, there are

various techniques that are also used to calculate CLV or the parameters needed to

calculate CLV. Aeron, Kumar and Janakiraman (2010) have presented various parameters

that may be useful in the calculation of CLV which include Acquisition rate, Retention rate,

Add-on-selling rate, Purchase Probability, Purchase amount, Discount rate, Referral rate and

Cost factor. However, all of these parameters may not be required in a single model. Various

researchers have used different techniques to calculate these parameters for calculating

CLV. Hansotia and Wang (1997) used Logistic Regression, Malthouse and Blattberg (2005)

used linear regression for predicting future cash flows, Dries and Poel (2009) used quantile

regression, Haenlein et al. (2007) used CART and markov chain model to calculate CLV. An

overview of various data mining techniques used to calculate the parameters for CLV have

been compiled by Aeron, Kumar and Janakiraman (2010). Besides this, many researchers

also use models like Pareto/NBD, BG/NBD, MBG-NBD, CBG-NBD, Probit, Tobit, ARIMA,

Support vector machines, Kohonen Networks etc., to calculate CLV . Malthouse (2009)

presents a list of these methods used by academicians and researchers who participated in

the Lifetime Value and Customer equity Modelling Competition.

Most of the above mentioned models are used either to calculate the variables used

to predict CLV or to find a relationship between them. In our research, we have used several

non-linear techniques like Classification and Regression Trees (CART), Support Vector

Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, Multilayer

Perceptron (MLP) and Wavelet Neural Network (WNN) to calculate CLV which takes care of

the relationship between the variables which act as input variables in the prediction of CLV.

Further we also make a comparison of these techniques to find the best fitted model for the

dataset we used. Later on we make conclusions and discuss the areas of future research.

2. Literature Review: Before going into the details of various models of CLV, let us first have

a look on the various approaches designed for calculating CLV. CLV can broadly be classified

in 2 ways: a) Aggregate Approach b) Individual Approach

2.1 Aggregate Approach: This approach revolves around calculating Customer Equity (CE) of

a firm. Customer Equity is nothing but the average CLV of a cohort of customers. Various

researchers have devised different ways to calculate CE of a firm. Gupta, Lehman and Stuart

Page 9: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

6

(2004) have calculated CE by summing up the CLV of all the customers and taking its

average. Berger and Nassr (1998) calculated CLV from the lifetime value of a customer

segment. They also took into account the rate of retention and the average acquisition cost

per customer.

Avg. CLV = ∑ !�" # $/�1 ' �(��) – A

Here, r=rate of retention

A= Avg. Acquisition cost per customer

Kumar and Reinartz (2006) gave a formula for calculating the retention rate for a customer

segment as follows :

Retention rate(%) = N+.+- ./01+2340 56 0372361 8/9567 56 �1� – 6+.+- ./01+2340 56 0372361 8/9567 56 �1����� �; <��=��� �� ���=�� �>��� �� ����

Projecting Retention rate : ?$ @ ?$< A �1 B C���

Here, ?$ = predicted retention rate for a given period of time in future.

And ?$< = Max attainable retention rate, given by the firm.

r = coefficient of retention and calculated as

r= (1/t) * (ln(?$<) – ln(?$< B ?$) )

This model is good enough for calculating the CLV of a segment of customers over a small

period of time, however the fluctuation of retention rate and gross contribution margin

needs to be taken care of while projecting CLV for longer periods. Taking this into account

they proposed another model which calculated the profit function over time, which can be

calculated separately. This models is given as :

CLV = ∑ D������ x [$ �1 ' E�⁄ ] ,

where D��� is the profit function over time.

Blattberg, Getz and Thomas (2001) calculated average CLV or CE as the sum of return on

acquisition, return on retention and return on add-on selling rate across the entire customer

base. They summarized the formula as :

�G��� @ HIJ�,L�,MN�, B O�,P B J�,Q�,�, ' HJ�,L�, RSTU,�VVU�� WX

V��Y

��)A MN�,�V B O�,�V BQ�,�,�V BQ�,Z[,�VP \ 11 ' E]

V^ where,

Page 10: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

7

CE(t) is the customer equity value for customers acquired at time t,

J�, is the number of potential customers at time t for segment i,

L�, is the acquisition probability at time t for segment i,

T�, is the retention probability at time t for a customer in segment i,

Q�,�, is the marketing cost per prospect (N) for acquiring customers for segment i,

Q�,�, is the marketing costs in time period t for retained customers for segment i,

Q�,Z[, is the marketing costs in time period t for add-on selling for segment i

d is the discount rate

N�, is the sales of the product/services offered by the firm at time t for segment i,

O�, is the cost of goods at time t for segment i.

_ is the number of segments,

� is the segment designation and

�� is the initial time period.

Rust, Lemon and Zeithaml (2004) used a CLV model in which they considered the case

where a customer switches between different brands. However, in using this model, one

needs to have a customer base which provides information about previous brands

purchased, probability of purchasing different brands etc. Here the CLV of customer i to

brand j is given as :

����U @ H 1�1 ' EU�/;� ��U A D�U A Q�U��`��

where,

a�U is the number of purchases customer i makes during the specified time period,

EU is firm js discount rate,

c� is the average number of purchases customer i makes in a unit time (eg. Per year)

��U is customer i’s expected purchase volume of brand j in purchase t

D�U is the expected contribution margin per unit of brand j from customer i in purchase t

Q�U is the probability that customer i buys brand j in purchase t .

Page 11: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

8

The Customer Equity (CE) of firm j is then calculated as the mean CLV of all customers across

all firms multiplied by the total number of customers in the market across all brands.

2.2 Individual Approach : In this approach, CLV is calculated for an individual customer as

the sum of cumulated cash flows – discounted using WACC (Weighted avg. cost of capital) of

a customer over his or her entire lifetime (Kumar and George, 2007). The CLV in this case

depends on the activity of the customer or his expected number of purchases during the

prediction time period and also his expected contribution margin. The basic formula for CLV

in this approach is :

���� @ Hf�gO��hC�� A i�����

where,

i�� is the gross contribution margin for customer i in period t.

This approach brings into light the need for calculating the probability of a customer being

active or P(active). There are various ways to calculate P(active) :

V. Kumar (2007) have calculated P(active) as :

P(Active) = �a J⁄ �� ,

where,

n is the number of purchases in the observation period,

T is the time elapsed between acquisition and the most recent purchase and,

N is the time elapsed between acquisition and the period for which P(Active) needs to be

calculated.

This model however, is quite trivial. Several researchers have used statistically advanced

methods to calculate P(active) or the expected frequency of purchase. Most of them have

also taken into account other factors like channel communication, recency of purchase,

customers characteristics, switching costs, first contribution margin etc. to make the

predictions more accurate.

Venkatesan and Kumar (2004) in his approach to calculate CLV predicted the customer’s

purchase frequency based on their past purchases. The CLV function in this case is

represented as :

���� = ∑ ��,j�����j/klmnompqj���>�� B ∑ ∑ <�,r,sr x u�,r,s�����svw�x��

Page 12: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

9

where,

���� is the lifetime value of customer i,

�"�,> is the contribution margin from customer i in purchase occasion y,

$ is the discount rate,

O�,=,x is the unit marketing cost for customer i in channel m in year l,

#�,=,x is the number of contacts to customer i in channel m in year l,

c$CyzC{O|� is the predicted purchase frequency for customer i, number of years to

forecast, and

a� is the predicted number of purchases made by customer i until the end of the planning

period.

Besides this, there have been various others models and techniques which calculate

P(Active) or the expected frequency of purchase which include Pareto/NBD, BG/NBD, MBG-

NBD, CBG-NBD, Probit, Tobit, generalized gamma distribution, log-normal distribution etc.

Various researchers and academicians who participated in the 2008 DMEF CLV Modelling

Competition have used some of these models to calculated CLV. We will come to know

more about these in the next part of the paper, when we study the various models and

techniques used by researchers to calculate the parameters of CLV or CLV itself.

As we have seen there are various aggregate and disaggregate approaches to calculate CLV.

The obvious question which one comes across is which model we use. Kumar and George

(2007) have given a detailed discussion of the comparison of these models. They observed

that an aggregate approach performs poorly in terms of time to implement and expected

benefits and a disaggregate approach has higher data requirement and more metrics to

track. They have also concluded that the model selection should depend on the requirement

of the firm and which criteria would they more importance to in comparison of others. For

example one may consider the cost involved as an important factor while others may

consider expected profits as a major factor of contribution. Kumar and George (2007) have

also proposed an integrated or hybrid approach to calculate CLV. In this approach,

depending on the various details of a customer, an appropriate approach is adopted. If the

firm’s transaction data and firm-customer interaction data in available then individual

approach of Venkatesan and Kumar (2004) is adopted. If this data is not available, but

segment level data is available then Blattberg, Getz and Thomas (2001) approach is

adopted, if size of wallet information of customers is not available, but survey data is not

available then Rust, Lemon and Zeithaml (2004) approach is adopted.

Page 13: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

10

2. 3 Models and Techniques to calculate CLV : There are various models to calculate CLV.

Most of the models calculate the parameters to measure CLV using different models and

then combine the same as a new method to calculate CLV. For example Fader, Hardie and

Lee (2005) captured recency and frequency in one model to calculate the expected the

number of purchases and built another model to calculate the monetary value. Reinartz,

Thomas and Kumar (2005) captured customer acquisition and retention simultaneously.

Gupta et. al. (2006) have given a good review on modelling CLV. We will try to use some of

his modelling methods in this paper with more examples and understanding.

2.3.1 RFM Models : RFM Models have been in used in direct marketing for more than 30

years. These type of models are most common in industry because of their ease of use.

These type of models are based on three levels of information from customers i.e their

recency, frequency and Monetary contribution. Fader, Hardie and Lee (2005) have shown

that RFM variables can be used to build a CLV model and that RFM are sufficient statistics

for their CLV model. We now present in brief about two RFM based models used to

determine CLV.

Weighted RFM Model : Mahboubeh Khajvand and Mohammad Jafar Tarokh, (2010) have

presented his model for estimating customer future value based on the data given by an

Iranian Bank. In this model they got the raw data from an Iranian Bank and calculated the

recency, frequency and Monetary value of each customer. Using various clustering

techniques like K-mean clustering, they segment the data into various groups and calculate

the CLV for each cluster using the following formula:

���<�= J?<� A }~ ' J�<� A }� + J"<� A }�

where,

}� is the weight of recency, frequency and monetary value obtained by AHP method based

on expert people idea.

The key limit to this modelling approach is that it is scoring model rather than a CLV Model.

It divides customers into various segments and then calculates a score for each segment.

They don’t actually provide a dollar value for each customer. Hence to overcome this,

Mahboubeh Khajvand and Mohammad Jafar Tarokh, (2010) proposed a multiplicative

seasonal ARIMA – Auto Regressive Integrated Moving Average method to calculate CLV,

which is a time series prediction method. The multiplicative seasonal ARIMA(p,d,q)x(P,D,Q)s

model where,

p = order of auto regressive process

d = order of differencing operator

q = order of moving average process

Page 14: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

11

P = order of seasonal auto regressive process

D= order of seasonal differencing operator

Q= order of seasonal moving average process

Can be represented by : ��Q����Q��0D��x1 @ θ�B0�θ�B�ε1 where,

��Q� is auto regressive process,

��Q� is moving average process,

�� is d-folding differencing operator which is used to change a nonstationary time series to

a stationary one,

��Q�� is the seasonal moving average process and,

�0D is the D-fold differencing operator

The main limitation of this model was that it predicted the future value of customers in the

next interval only due to lack of data.

RFM and CLV using Iso-value curves : Fader, Hardie and Lee (2005) proposed this model to

calculate CLV. They showed that no other information other than RFM characteristics are

required to formulate this model. Further they have also used the “lost for good” approach

to formulate this model, which means that the customers who leave the relationship with a

firm never come back. It is also assumed that M is independent of R and F. This suggests

that the value per transaction can be factored out and we can forecast the flow of future

transactions. We can then rescale this number of discounted expected transaction (DET) by

a monetary value (a multiplier) to yield a dollar number for each customer. This models is

formulated as :

CLV = margin x revenue/transaction x DET

The calculation of DET is the most important part of this model. Fader, Hardie and Lee

(2005) first of all calculated DET for a customer with observed behaviour (X=x, �u , T) as :

Page 15: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

12

Here, the numerator is the expected number of transactions in period t and d is the discount

rate. However, according to Blattberg, Getz and Thomas (2001) this calculation of CLV has

the following problems : a) we don’t know the time horizon in projecting the sales, b) What

time periods to measure and c) The expression ignores specific timing of transactions.

Hence they used Pareto/NBD model by using a continuous-time formulation instead of

discrete time formulation to compute DET (and this CLV) over an infinite time horizon. The

DET is thus calculated as :

where r, L, �, � are the pareto/NBD parameters. �(.) is the confluent hypergeometric

function of second kind; and L(.) is the pareto/NBD likelihood function. Now they added a

general model of monetary value to a dollar value of CLV assuming that a customer’s given

transactions varies around his/her average transaction value. After that they checked

various distributions to find that the gamma distribution best fitted their model and hence

calculated the expected average transaction value for a customer with an avg. spend of �u

across x transactions as :

This value of monetary value obtained multiplied with DET gave the CLV of a customer.

Following this, various graphs also called as iso-curves were drawn to identify customers

with different purchase histories but similar CLVs, like CLV – frequency, CLV – Recency, CLV

– frequency – recency etc. They key limitations of this model is that it is based on a “non-

contractual purchase” model. It is not clear which distribution should be used to calculate

the transaction incidence and transaction size immediately.

2.3.2 Computer Science and Stochastic Models : These types of models are primarily based

on Data mining, machine learning, non parametric statistics and other approaches that

emphasize predictive ability. These include neural network models, projection-pursuit

models, decision tree models, spline-based models (Generalized Additive Models (GAM),

Classification and Regression Trees (CART), Support Vector Machines (SVM) etc.). There are

various researchers who have been using these techniques to calculate CLV. Haenlein et al.

(2007) have used a model based on CART and 1st order Markov chains to calculate CLV. They

had the data from a retail bank. First of all they determined they various profitability drivers

as predictor variables together with target variables in a CART analysis to build a regression

tree. This tree helped them to cluster the customer base into a set of homogenous sub-

groups. They used these sub-groups as discrete states and estimate a transition matrix

Page 16: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

13

which describes movements between them, using markov chains. To estimate the

corresponding transition probability, they determined the state each customer belonged to

at the beginning and end of a predefined time interval T by using decision rules resulting

from CART analysis. In the final step the CLV of each customer group as the discounted sum

of state dependent contribution margins, weighted with their corresponding transition

probabilities was determined.

���� = ∑ �f� A �"��/�1 ' ����

where,

f� is the probability of transition from one state to other,

�"� is the contribution margin for customer i and

is the discount rate.

Finally, a study of the CLVs of each customer segment to carry out marketing strategies for

each segment was made. This model however has some limitations too. It was assumed that

assumed that client behaviour follows a 1st order markov process, which does not take into

account the behaviour of early periods, rendering it as insignificant. It was also assumed

that the transition matrix is stable and constant over time, which seems inappropriate for

long term forecasts and the possibilities of brand switching in customer behaviour are not

taken into account.

Malthouse and Blattberg (2005) have used linear regression to calculate CLV. The CLV in this

case is related to the predictor variables with some regression function f as

where, C� are independent random variable with mean 0 and error variance V(C�) = ��,

Invertible function g is a variance stabilizing transformation. We can consider various

regression models for this function : a) Linear regression with variance stabilizing

transformations estimated with ordinary least squares. b) Linear regression estimated with

iteratively re-weighted least squares(IRLS). c) Feedforward neural network estimated using

S-plus version 6.0.2. Methods like k-fold cross validation are used to check the extent of

correctness of the analysis.

Dries and Van den Poel (2009) have used quantile regression over linear regression to

calculate CLV. It extends the mean regression model to conditional quantiles of the

response variables like the median. It provides insights into the effects of the covariates on

the conditional CLV distribution that may be missed by the least squares method. In

prediction of the top x-percent of the customers, quantile regression method is a better

Page 17: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

14

method than the linear regression method. The smaller the top segment of interest, the

better estimate of predictive performance we get.

Besides, other data mining techniques like Decision Trees (DT), Artificial Neural Networks

(ANN), Genetic Algorithm (GA), Fuzzy Logic and Support Vector Machines (SVM) are also in

use but mostly to calculate CLV metrics like customer churn, acquisition rate, customer

targeting etc. Among DT the most common are C4.5, CHAID, CART and SLIQ. Again ANN

have also been used to catch non linear paterns in data. Besides, it can be used for both

classification and regression purposes depending on the activation function.

Malthouse and Blattberg (2005) used ANN to predict future cash flows. Aeron and Kumar

(2010) have mentioned about different approaches for using ANN. First is the generalised

stack approach used by Hu and Tsoukalas (2003) where ensemble method is used. The data

is first divided into three groups. The first group has all situational variables, the second has

all demographics variables and the third has both situational and demographic variables.

The other is the hybrid approach of GA/ANN by Kim and Street (2004) for customer

targeting where, GA searches the exponential space of features and passes one subset of

features to ANN. The ANN extracts predictive information from each subset and learns the

patters. Once, it finds the data patters, it is evaluated on a data set and returns metrics to

GA. ANN too is not without limitations. It cannot handle too many variables. So, various

other algorithms like GA, PCA (Principal Component Analysis) and logistic regression are

used for selecting variables to input in ANN. There is no set rule to find ANN parameters.

Selection of these parameters is a research area in itself. Besides all this initial weights are

decided randomly in ANN, which takes longer time to reach the desired solution.

Genetic Algorihm (GA) are more suitable for optimization problems as they achieve global

optimum with quick convergence especially for high dimensional problems. GA have seen

varied applications among CLV parameters like multiobjective optimization (using Genetic-

Pareto Algorithm), churn prediction, customer targeting, cross selling and feature selection.

GA is either used to predict these parameters or optimize parameter selection of other

techniques like ANN. Besides GA, Fuzzy Logic and Support Vector Machines also find

applications for predicting churn and loyality index. There are many other techniques and

models like GAM(Generalized Addictive Models), MARS(Multivariate Adaptive Regression

Splines), Support Vector Machines (SVM) etc. which are used to predict or optimize the

various parameters for CLV like churn rate, logit, hazard functions, classification etc. Churn

Rate in itself is a very vast area of CRM which can be used as a parameter in the prediction

of CLV and many other related models. There have been many worldwide competitions and

tournaments in which various academics and practitioners use various methods by

combining different models to get the best possible results. These approaches remain little

known in the marketing literature and has a lot of scope for further research. The 2008

DMEF CLV Competition was one such competition in which various researchers and

academicians came together to compete for the three tasks in that competition. Malthouse

Page 18: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

15

(2009) have made a compilation of the various models which were presented in that

competition.

2.3.3 Growth/Diffusion Models : These types of models focus on calculating the CLV of

current and future customers. Forecasting acquisition of future customers can be done in 2

ways : The first approach uses disaggregate customer data and builds models that predict

the probability of acquiring a particular customer (Thomas, Blattberg and Fox, 2004). The

other approach is to use aggregate data and use diffusion or growth to predict the no. of

customers a firm is likely to acquire in the future (Gupta, Lehman and Stuart, 2004). The

expression for forecasting the number of new customers at time t is :

, where L, �, � are parameters of the customer growth curve

Using this, they estimated the CE of a firm as :

where,

{V is the the no. of newly acquired customers for a segment k,

m is the margin,

r is retention rate,

i is the discount rate, and

c is acquisition cost per customer.

Diffusion models can also be used to assess the value of a lost customer. For eg. In a banking

Industry which has recently acquired a new technology will have some customers who will

be reluctant to that change and will be lost. If the relative proportions of lost customers are L, then value of average lost customer is :

2.3.4 Econometric Models : Gupta et. al (2006) have given a good review on this type of

models. We will present the same in brief in this paper with an example of a right censored

tobit model by Hansotia and Wang (1997). Econometric models study customer acquisition,

retention and expansion (cross selling or margin) and combine them to calculate CLV.

Customer Acquisition and Customer Retention are the key inputs for such a type of model.

Various models relate customer acquisition and retention and come up with new models to

calculate CLV. For example the right censored Tobit Model for CLV (Hansotia and Wang,

Page 19: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

16

1997). It has also been shown by some researchers (Thomas, 2001) that ignoring the link

b/w customer acquisition and retention may cause a 6-50% variation from these models.

For example if we spend less money on acquisition, the customers might walk away soon. In

case of retention models, they are broady classified into two main categories : a) the first

one considers the “lost for good” approach and uses hazard models to predict the

probability of customer deflection, b) the second one considers the “always a share

approach” and typically uses markov models. Hazard models are used to predict probability

of customer deflection. They again are are of two types : a) Accelerated Failure time (AFT)

(Kalbfleisch and Prentice, 1980) and b) Proportional Hazard Models (PH) (Levinthal and

Fichman , 1988). AFT is of the form :

ln(�U) = �U�U ' ��U ,

where, where t is purchase duration for customer j and X are covariates.

Different specifications of � and � lead to different models such as Weibull or generalized

gamma Model. Again PH models specify the hazard rate (�) and covariates (X) as :

���; �� @ �����exp ����. We get different models like exponential, weibull, gompertz etc. for different specifications.

Hansotia and Wang (1997) used a right censored tobit model to calculate the lifetime value

of customers or LTV as it was called then. It is a regression model with right censored

observations and can be estimated by the method of maximum likelihood. The present

value of a customer’s revenue (PVR) for the qth customer receiving package j was calculated

as :

,

where,

#� is the (K+1) dimensional column vector of profile variable for the qth customer. The

equation may also be estimated using LIFEBERG procedure in SAS. The likelihood function

which is the probability of observing the sample value was given by :

where, S=1 if observation i is uncensored and 0 otherwise.

Besides, the four type of models presented in this paper, Gupta et. al (2006) have also

mentioned about a probability model. However, in our research, it has been taken into

account in the Computer science and stochastic model. However Gupta et. al. (2006) have

Page 20: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

17

made a few assumptions in their review of probability models like the probability of a

customer being “alive” can be characterized by various probability distributions models.

They have also taken into account the heterogeneity in dropout rates across customers.

Various combinations of these assumptions results in models like pareto/NBD, beta-

binomial/beta-geometric (BG/BB), markov models etc. Other than that Gupta et. al. (2006)

have also mentioned about persistence models which has been used in some CLV context to

study the impact of advertising, discounting and product quality on customer equity (Yoo

and Hanssens, 2005) and to examine differences in CLV resulting from different customer

acquisition methods (Villanueva, Yoo, and Hanssens, 2006).

2.3.5 Some other Modelling Approaches :

Donkers et al. (2007) have also made a review of various CLV modelling approaches with

respect to the insurance industry sector. These include a status quo model, a Tobit-II model,

univariate and multivariate models and duration models. They grouped these models into

two types of models. First Relationship Level models – which focus on relationship length

and total profit, and is build directly on the definition of CLV as defined by Berger and Nasr

(1998) :

where, d is a predefined discount rate and

Profit�, for a multiservice industry is defined as :

,

where,

J is the number of different services sold,

Serv�U, is a dummy indicating whether customer i purchases service j at time t,

Usage�U, is the amount of service purchased, and

MarginU, is the average profit margin for service j.

and the second is the service level models- which disaggregate a customer’s profit into the

contribution per service. The CLV predictions are then obtained by predicting purchase

behaviour at the service level and combining the results of both models to calculate CLV. An

overview of the models as presented by Donkers et al. (2007) with their mathematical

models is given below :

Page 21: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

18

An overview of Relationship Level Models :

Here the Status Quo Model assumes profit simply remains constant over time. Profit

Regression Model aims at prediction of customer’s annual profit contribution. Retention

Models are based on segmenting over RFM. Probit Model is based on customer specific

retention probabilities. Bagging Model is also based on customer specific retention

probability. Duration Model is focused on customer’s relationship duration. Tobit II Model

separates the effect of customer deflection on profitability.

An Overview of Sevice-level-Models :

These types of models are explained as choice model approach and duration model

approach. Choice model approach has as dependent variable the decision to purchase a

service or not. Duration Model approach focuses on the duration of an existing relationship.

It only models the ending of a period and not the starting of a new one.

The next part of the paper presents the machine learning approach, we have used to

calculate the future value of customers. A dataset obtained from Microsoft Access 2000, the

Northwind Traders is adopted to demonstrate our approach. We have used Classification

and Regression Trees (CART), Support Vector Machines (SVM), SVM using SMO, Additive

Page 22: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

19

Regression, K-Star Method, Multilayer Perceptron (MLP) and Wavelet Neural Network

(WNN) to calculate the futute value of customers. In the later part of the paper, we make a

comparison of these models and suggest the best model to calculate the CLV. We end this

paper with results and discussion on the future development in the area of CLV

measurement.

3. Estimating Future Customer Value using Machine Learning Techniques:

There are various data mining techniques which are used in the field of classification and

regression. The use of a technique depends on the type of data available. In our case, we

have we have used the regression technique to determine the future value of customers in

the next prediction period. In the past, several researchers have used these techniques to

determine the metrics of CLV depending on the type of model and approach they have

used. Hansotia and Wang (1997) have used CART and CHAID for customer acquisition. Kim

and Street (2004) have used ANN for customer targeting, Au et al. (2003) used Genetic

Algorithms (GA) for predicting customer’s churn. However, using these techniques to

directly predict a customer’s future value and hence CLV have not been done so far. Most of

the previous approaches in measuring CLV have used two or more models to calculate

either CLV or determine the relationship between the various parameters used to

determine CLV. The approach which we have adopted tries to eliminate this process and

allows the software which uses this technique to predict the relationship between the input

variables and their weightage in calculating CLV.

3.1 Data Description : A sample database of Microsoft Access 2000, the Northwind Traders

database is adopted to calculate the CLV of customers. The database contains 89 customers

with a purchase period of 2 years from 1st

July 1994 till 30th

June 1996. We have divided this

time frame into 4 equal half years and calculated the frequency of purchase and the total

monetary contribution in July – December 1994, January – June 1995, July – December 1995

and January – June 1996. Further we kept the observation period from July, 1994 till

December 1995 and made a prediction of the expected contribution in the next period i.e.

January – June 1996.

The total variables used are 7, out of which 6 are input or predictor variables and the

remaining one i.e. contribution margin in jan-june, 1996 as the target variable. The entire

dataset is then dived in two parts: a) training and b) testing. We used 65 samples for training

the data and the remaining 24 for testing purposes.

Page 23: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

20

Table 1: Description of variables

Type of variable Variable Name Variable Description

Input Variable Recency-dec95 Calculates the recency as a

score, calculating july, 94 as

1 and dec, 95 as 18

Input Variable total frequency The total number of

purchases between july, 94

till dec, 95

Input Variable Total duration The total duration of

observation i.e from july 94

till dec, 95

Input Variable CM_july-dec94 The contribution margin in

the period july – dec, 94

Input Variable CM_jan-june95 The contribution margin in

the period jan – june, 95

Input Variable CM_july-dec95 The contribution margin in

the period july – dec, 95

Target Variable output The contribution margin in

the period jan – june, 96

3.2 Models and Software used: Knime 2.0.0 , Salford Predictive Miner (SPM), NeuroShell 2

(Release 4.0) and a software by Chauhan et al. (2009) developed at IDRBT for classification

problems in DEWNN, Hyderabad is used for analysis. In Knime, we have used Support

Vector Machines (SVM), SVM using SMO, Additive Regression, K-Star Method, for learning

purposes of the training dataset and the weka predictor for prediction of the testing

dataset. In Salford Predictive Miner (SPM), we used CART to train the dataset and applied

the rules obtained from the training dataset on the testing dataset for prediction. The

software developed at IDRBT, Hyderabad was used to train the data using Wavelet Neural

Network (WNN) and applied the learning parameters on the test data to get the results and

NeuroShell for MLP. We have given brief description of the techniques used for prediction

of the target variable.

3.2.1 SVM : The SVM is a powerful learning algorithm based on recent advances in statistical

learning theory (Vapnik, 1998). SVMs are learning systems that use a hypothesis space of

linear functions in a high-dimensional space, trained with a learning algorithm from

optimization theory that implements a learning bias derived from statistical learning theory

(Cristianini & Shawe-Taylor, 2000). SVMs have recently become one of the popular tools for

machine learning and data mining and can perform both classification and regression. SVM

uses a linear model to implement non-linear class boundaries by mapping input vectors

non-linearly into a high dimensional feature space using kernels. The training examples that

Page 24: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

21

are closest to the maximum margin hyper plane are called support vectors. All other training

examples are irrelevant for defining the binary class boundaries. The support vectors are

then used to construct an optimal linear separating hyper plane (in case of pattern

recognition) or a linear regression function (in case of regression) in this feature space. The

support vectors are conventionally determined by solving a quadratic programming (QP)

problem. SVMs have the following advantages: (i) they are able to generalize well even if

trained with a small number of examples and (ii) they do not assume prior knowledge of the

probability distribution of the underlying dataset. SVM is simple enough to be analyzed

mathematically. In fact, SVM may serve as a sound alternative combining the advantages of

conventional statistical methods that are more theory-driven and easy to analyze and

machine learning methods that are more data-driven, distribution-free and robust.

Recently, SVM are used in financial applications such as credit rating, time series prediction

and insurance claim fraud detection (Vinaykumar et al., 2008).

In our research, we used two SVM learner models for predictive purposes. First we used the

SVM Regression model as the learner function and then used weka predictor to get the

results. We found the correlation coefficient as 0.8889 and root relative squared squared

error as 48.03%.

In case of SVO (sequential minimal optimization algorithm) for training a support vector

regression model, we replaced the learner function by the SVOreg function. This

implementation globally replaces all missing values and transforms nominal attributes into

binary ones. It also normalizes all attributes by default. Here we found the correlation

coefficient as 0.8884 and the root relative squared error as 47.98%.

3.2.2 Additive Regression and K-star: Addtive Regression is another classifier used in weka

that enhances the performance of a regression base classifier. Each iteration fits a model to

the residuals left by the classifier on the previous iteration. Prediction is accomplished by

adding the predictions of each classifier. Reducing the shrinkage (learning rate) parameter

helps prevent overfitting and has a smoothing effect but increases the learning time. K-star

on the other hand is an instance-based classifier, that is the class of a test instance is based

upon the class of those training instances similar to it, as determined by some similarity

function. It differs from other instance-based learners in that it uses an entropy-based

distance function. These techniques are quite similar to what we did in SVM Regression and

SMO Regression learners using weka predictors.

In Additive Regression, we found the correlation coefficient as 0.895, the root mean squared

error as 3062.19 and the root relative squared error as 44.36%. In case of K-star, we found

the correlation coefficient as 0.9102, root mean squared error as 3203.57 and the root

relative squared error as 46.41%.

Page 25: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

22

3.2.3 MLP : Multilayer Perceptron (MLP) is one of the most common neural network

structures, as they are simple and effective, and have found home in a wide assortment of

machine learning applications. MLPs start as a network of nodes arranged in three layers –

the input, hidden, and output layers. The input and output layers serve as nodes to buffer

input and output for the model, respectively, and the hidden layer serves to provide a

means for input relations to be represented in the output. Before any data is passed to the

network, the weights for the nodes are random, which has the effect of making the network

much like a newborn’s brain – developed but without knowledge. MLPs are feed-forward

neural networks trained with the standard back propagation algorithm. They are supervised

networks so they require a desired response to be trained. They learn how to transform

input data into a desired response So they are widely used for pattern classification and

prediction. A multi-layer perceptron is made up of several layers of neurons. Each layer is

fully connected to the next one. With one or two hidden layers, they can approximate

virtually any input–output map. They have been shown to yield accurate predictions in

difficult problems (Rumelhart, Hinton, & Williams, 1986, chap. 8).

In our research, we used NeuroShell 2 (version 4.0) to determine the results. For learning

purposes we set the learning rate as 0.5, momentum rate as 0.1 and the scale function as

linear [-1,1] to get the best results. We found the root mean squared error as 43.8 % which

was the least among all other methods used, as we will find out later.

3.2.4 WNN : The word wavelet is due to Grossmann et al. (1984). Wavelets are a class of

function used to localize a given function in both space and scaling

(http://mathworld.wolfram.com/wavelet.html). They have advantages over traditional

Fourier methods in analyzing physical situations where the signal contains discontinuities

and sharp spikes. Wavelets were developed independently in the fields of mathematics,

quantum physics, electrical engineering and seismic geology. Interchanges between these

fields during the last few years have led to many new wavelet applications such as image

compression, radar and earthquake prediction.

A family of wavelet can be constructed from a function ( )xψ known as “mother wavelet”,

which is confined in a finite interval ”Daughter Wavelets” ,( )

a bxψ are then formed by

translation (b) and dilation (a). Wavelets are especially useful for compressing image data.

An individual wavelet is defined by

In the case of non-uniformly distributed training data, an efficient way of solving this

problem is by learning at multiple resolutions. Wavelets in addition to forming an

orthogonal basis are capable of explicitly representing the behaviour of a function at various

, 1/2( ) | | ( )a b x bx

aψ α − −

= Ψ

Page 26: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

23

resolutions of input variables. Consequently, a wavelet network is first trained to learn the

mapping at the coarsest resolution level. In subsequent stages, the network is trained to

incorporate elements of mapping at higher and higher resolutions. Such hierarchical, multi

resolution has many attractive features for solving engineering problems, resulting in a

more meaningful interpretation of the resulting mapping and more efficient training and

adaptation of the network compared to conventional methods. The wavelet theory provides

useful guidelines for the construction and initialization of networks and consequently, the

training times are significantly reduced (http://www.ncl.ac.uk/pat/neural-networks.html).

Wavelet networks employ activation functions that are dilated and translated versions of a

single function, where d is the input dimension (Zhang, 1997). This function called the

‘mother wavelet’ is localized both in the space and frequency domains (Becerra, Galvao,

Abou-Seads 2005). The wavelet neural network (WNN) was proposed as a universal tool for

functional approximation, which shows surprising effectiveness in solving the conventional

problem of poor convergence or even divergence encountered in other kinds of neural

networks It can dramatically increase convergence speed (Zhang et al., 2001).

The WNN network is consists of three layers namely input layer, hidden layer and output

layer. Each layer is fully connected to the nodes in the next subsequent layer. Number of

input and output nodes depends on the number of inputs and outputs present in the

problem. The number of hidden node can be any number from 3 to 15is a user-defined

parameter depending on the problem. WNN is implemented here with the Gaussian

wavelet function.

The original training algorithm for training a WNN is as follows (Zhang et al., 2001):

1) Specify the number of hidden nodes required. Initialize randomly the dilation and

translation parameters and the weights for the connections between the input and

hidden layers and also between the hidden and the output layers.

2) The output value of the sample �£, k = 1,2,……..,np, is calculated with the following

formula :computed as follows:

(1)

where, nin is the number of input nodes and nhn is the number of hidden nodes

and np is the number of samples.

In (1) when f(t) is taken as Morlet mother wavelet is has the following form :

2

( ) cos(1.75 )exp( / 2)f t t t= − (2)

1

1

( )

n i n

i j k i jn h n

iK j

j j

w x b

V W fa

=

=

=∑

Page 27: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

24

And when taken as Gaussian wavelet it becomes

(3)

3) Reduce the error of prediction by adjusting updating using , , ,j ij j jW w a b using

, , ,j ij j jW w a b∆ ∆ ∆ ∆ (see formulas (4)-(7)). Thus, in training the WNN, the gradient descend

algorithm is employed:

(4)

(5)

(6)

(7)

where, the error function can be taken as

1/2

2

21

( )k npK

K

k K

V VE

V

=

=

=

∑ , (8)

Where η and α are the learning and the momentum rates respectively.

4) Return to step (2) the process is continued until E satisfies the given error criteria, and

the whole training of the WNN is completed.

Some problem exists in the original WNN such as slow convergence, entrapment in local

minima and oscillation (Pan et al., 2008). We propose BFTWNN to resolve these problems.

In our research, we used a software made by Chauhan et al. (2009) for DEWNN (Differential

evolution trained Wavelet Neural Network). The software was initially made for

classification purposes. We changed the software code from classification to regression type

and used it in our problem. We set the weight factor as 0.95, convergence criteria as

0.00001,crossover factor as 0.95, population size as 60, number of hidden node as 20,

maximum weight as 102 and minimum weight as -102 to find the optimum solution. We

found the test set normalized root mean square error as 0.928441. The root relative

squared error as 111.2 %, which was the highest amongst all the results.

3.2.5 CART : Decision trees form an integral part of ‘machine learning’ an important sub-

discipline of artificial intelligence. Almost all the decision tree algorithms are used for solving

2( ) exp( )f t t= −

( 1) ( ),j j

j

EW t W t

Wη α

∂∆ + = − + ∆

( 1) ( ),( )

ij ij

ij

Ew t w t

w tη α

∂∆ + = − + ∆

( 1) ( ),( )

j j

j

Ea t a t

a tη α

∂∆ + = − + ∆

( 1) ( ),( )

j j

j

Eb t b t

b tη α

∂∆ + = − + ∆

Page 28: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

25

classification problems. However, algorithms like CART solve regression problems also.

Decision tree algorithms induce a binary tree on a given training data, resulting in a set of

‘if–then’ rules. These rules can be used to solve the classification or regression problem.

CART (http:// www.salford-systems.com) is a robust, easy-to-use decision tree tool that

automatically sifts large, complex databases, searching for and isolating significant patterns

and relationships. CART uses a recursive partitioning, a combination of exhaustive searches

and intensive testing techniques to identify useful tree structures in the data. This

discovered knowledge is then used to generate a decision tree resulting in reliable, easy-to-

grasp predictive models in the form of ‘if–then’ rules. CART is powerful because it can deal

with incomplete data; multiple types of features (floats, enumerated sets) both in input

features and predicted features, and the trees it produces contain rules, which are humanly

readable. Decision trees contain a binary question (with yes/no answer) about some feature

at each node in the tree. The leaves of the tree contain the best prediction based on the

training data. Decision lists are a reduced form of this where an answer to each question

leads directly to a leaf node. A tree’s leaf node may be a single member of some class, a

probability density function (over some discrete class) or a predicted mean value for a

continuous feature or a Gaussian (mean and standard deviation for a continuous value). The

key elements of a CART analysis are a set of rules for: (i) splitting each node in a tree, (ii)

deciding when a tree is complete; and (iii) assigning each terminal node to a class outcome

(or predicted value for regression).

In our research, we used Salford Predictive Miner (SPM) to use CART for prediction

purposes. We trained the model using least absolute deviation on the training data. We

found that the root mean squared error was 3367.53 and the total number of nodes was 5,

however, on growing the tree nodes from 5 to 6, we found better results. The root mean

squared error changed to 3107.13 and the root relative squared error is 45.38% which is

very close to MLP. Figure 3 shows the plot of relative vs. The number of nodes. We see that

we got the optimum results on growing the tree from node 5 to node 6.

Figure 1 : CART : Plot of relative error vs number of nodes

Figure 2: CART : Plot of percent error vs. Terminal nodes

Page 29: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

26

It was also seen from the results that, when the optimum number of nodes were kept at 5,

19 out 24 customers were put in node 1, 4 in node 3 and 1 in node 6. We also found that the

root mean squared error was 2892.6 for the 19 customers in node 1, which is better than

the overall error. However, the overall increase in error was caused due to misclassification

or high error rate in splitting customers in node 4 and node 6. In case of growing, optimum

nodes to 6, we found that 14 customers were split in node 1, 5 in node 2, 4 in node 4 and 1

in node 6. The RMSE in node 1 was 1846.89, which was way less than the total RMSE of

3107.13. One obvious conclusion, one can draw from CART is that it is more useful than

other methods for prediction, because of its rules which gives companies the flexibility to

decide which customer to put in which node and also to choose the optimum number of

nodes for their analysis.

Figure 3 : CART : Tree details showing the splitting rules at each node

A summary of the rules is given as :

1. if(CM_JULY_DEC95 <= 2278.66 && CM_JAN_JUNE95 <= 3534.06 ) then y = 1511.64

2. if(CM_JULY_DEC95 <= 2278.66 &&CM_JAN_JUNE95 > 3534.06 && CM_JAN_JUNE95

<= 12252.1 ) then y = 5932.26

3. if(CM_JAN_JUNE95 <= 12252.1 && CM_JULY_DEC95 > 2278.66 && CM_JULY_DEC95

<= 2464.75 ) then y = 24996

4. if(CM_JAN_JUNE95 <= 12252.1 && CM_JULY_DEC95 > 2464.75 &&

TOTAL_FREQUENCY <= 14 ) then y = 6350.25

5. if(CM_JAN_JUNE95 <= 12252.1 && CM_JULY_DEC95 > 2464.75 &&

TOTAL_FREQUENCY > 14 ) then y = 19044.4

6. if(CM_JAN_JUNE95 > 12252.1 ) then y = 38126.7 ; where, y is median

Page 30: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

27

4. Results and Comparison of Models : We have used various machine learning techniques

to calculate the future value of 24 customers from a sample of 89 customers. We have used

various techniques like SVM, WNN, Additive Regression, K-star Method in Knime using weka

predictor, CART in SPM and MLP in NeuroShell. We found that MLP has given the least error

amongst all these models, but we find CART to be more useful, as is more helpful in taking

decisions by setting splitting rules and also predicts more accurately for a greater section of

the test sample by splitting the sample into various nodes. We find that companies can

make better decisions with the help of these rules and the segmentation technique in CART.

A detailed summary of the final results of competing models is given in Table 2. One

limitation of our study is that we have only predicted the future value of only the next time

period. Besides this, the error percentage is relative high, because of the small amount of

dataset we have. We believe that these models will be able to perform better in case of

large dataset with more input variables including customer demographics, customer

behaviour etc.

Table 2 : Comparison of Competing Models

Correlation

coefficient

Root Mean

Squared

error

Mean

Absolute

error

Root relative

squared error

SVMreg 0.8889 3315.25 2513.03 48.0%

SMOreg 0.8884 3311.98 2499.48 47.9%

Additive Reg. 0.8950 3062.19 2203.76 44.3%

K-star 0.9102 3203.57 2233.21 46.4%

MLP NA 2986.77 2107.10 43.8%

CART NA 3107.13 2343.82 45.3%

41

42

43

44

45

46

47

48

49

MLP Additive Reg. CART K-Star SMOreg SVMreg

Figure 4 : Graph of Error vs Model

Page 31: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

28

5. Conclusion and Directions of future research: In this paper we have presented a review

of various approaches and modelling techniques to determine Customer Lifetime Value. We

have also covered the tradional techniques used to calculated Customer Loyalty and found

that CLV is better metric compared to these measures. The most common approaches used

to measure CLV are aggregate approach and individual approach. We also see that the type

of approach used to calculate CLV depends on the type of data available and the type of

result which a firm wants. Further, we have also reviewed various modelling techniques to

determine CLV, which include RFM Models, Computer Science and Stochastic Models,

Econometric Models, Diffusion Models and also relationship level models and service level

models. We see that the most frequently applied techniques to determine CLV parameter or

to determine the relationship between them include, Pareto/NBD models, Decision trees,

Artificial Neural Networks, Genetic Algorithms, Support Vector Machines.

We have also presented a study of measuring CLV by means of various machine learning

techniques. Emphasis has been given to catch the non-linear pattern in the data which was

available for a set of 89 customers having a 2 year transaction history. We have used

Classification and Regression Trees (CART), Support Vector Machines (SVM), SVM using

SMO, Additive Regression, K-Star Method, Multilayer Perceptron (MLP) and Wavelet Neural

Network (WNN) for the calculation of the future value 24 customers. Further we see that

although MLP gives the best result amongst all these models, we would still recommend

using CART to calculate CLV as it segments the customers into various nodes and calculates

more precisely for a larger segment of test case customers. Besides, the splitting rules

would also help any firm to understand better the classification of a customer into a

particular segment and hence derive more profit out of him.

The main limitations of our study have been the projection of future value of customers till

only the next period, mainly due to the limitation of the dataset we had. This also resulted

in some high error rates even amongst the best models. There limitations can be overcome

by using datasets which can give more information about the customer behaviour, his

demographics etc. Besides, a large dataset will be useful to make better predictions as it can

estimate the training parameters better. For better estimation in small datasets, we have

not covered techniques like k-fold cross validation, which again, can be taken as an area of

future research. We have also not given much emphasis on feature selection and the

relationship between the input variables to calculate CLV. Producing better results with an

integrated approach with this dataset is again an area of future research.

Page 32: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

29

References:

Aeron, H., Kumar, A. and Janakiraman, M. (2010) ‘Application of data mining techniques for

customer lifetime value parameters : a review’, Int. J. Business Information Systems, Vol. 6,

No. 4, pp.514- 529.

Au, W., Chan, K., & Yao, X. (2003). A novel evolutionary data mining algorithm with

applications to churn prediction. IEEE Transactions on Evolutionary Computation, 7(6), 532–

545.

Becerra, V. M., Galvao, H., & Abou-Seads, M. (2005). Neural and wavelet network models

for financial distress classification. Data Mining and Knowledge Discovery, 11, 35–55. doi:1

0.1007/s1 0618-0051360-0

Berger, P. D. and Nasr, N. I. (1998), Customer lifetime value: Marketing models and

applications. Journal of Interactive Marketing, 12: 17–30

Blattberg, Robert C., Getz G., Thomas js (2001), ''Customer Equity: Building and Managing

Relationships as Valuable Assets'', Boston, MA : Harvard Business School Press.

Chauhan, N., V. Ravi, D. Karthik Chandra: Differential evolution trained wavelet neural

networks: Application to bankruptcy prediction in banks.Expert Syst. Appl. 36(4): 7659-7665

(2009)

Cristianini, N. and J. Shawe-Taylor (2000). An Introduction to Support Vector Machines.

Cambridge, UK: Cambridge University Press.

Donkers, B. P.C. Verhoef and M.G. de Jong (2007) Modeling CLV: a Test of Competing

Models in the Insurance Industry, Quantitative Marketing and Economics, 5 (2) 163-190.

Dries F. Benoit, Dirk Van den Poel: Benefits of quantile regression for the analysis of

customer lifetime value in a contractual setting: An application in financial services. Expert

Syst. Appl. 36 (7): 10475 - 10484 (2009)

Dwyer, R.F (1997) ‘Customer lifetime valuation to support marketing decision making’,

Journal of Direct Marketing, Vol. 11, No. 4, pp.205 – 219.

Fader, Peter S., Bruce G. S. Hardie, and Ka Lok Lee (2005), “RFM and CLV: Using Iso-CLV

Curves for customer base analysis”, Jounal of Marketing Research, 42 (November), 415-30.

Gupta, Sunil, Donald R. Lehmann and Jennifer Ames Stuart (2004), “Valuing Customers”,

Journal of Marketing Research, 41 (1), 7-18.

______, Hanssens, D., Hardie, B., Kahn, W., Kumar, V., and Lin, N. Modelling Customer

Lifetime Value. Journal of Service Research, 9, 2006, 139-155.

Page 33: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

30

Hansotia, B. J. and P. Wang (1997), ‘Analytical challenges in customer acquisition’. Journal of

Direct Marketing 11(2), 7-19.

Haenlein, M., Kaplan, A.M., Beeser, A.J. (2007) A model to determine customer lifetime

value in a retail banking context, European management journal.

Hu, M. and Tsoukalas, C., “Explaining Consumer Choice through Neural Networks: The

Stacked Generalization Approach,” European Journal of Operational Research, Vol. 146, No.

3, 2003, 650-661.

Kalbfleisch, J. D. and R. L. Prentice. (1980), Statistical Analysis of Failure Time Data, New

York: Wiley

Kim, Y., Street, N. (2004). An intelligent recommendation system for customer targeting: A

data mining approach. Decision Support Systems, 37(2), 215-228

Kumar, V. and J. Werner Reinartz (2006), Customer Relationship Management : A Databased

Approach. New York : John Willey.

_______ and Morris George (2007), Journal of the Academy of Marketing Science, 35:157-

171.

Levinthal, D. and M. Fichman. (1988). “Dynamics of Interorganizational Attachments:

Auditor Client Relationships”. Administrative Science Quarterly, 33, 345-69.

Malthouse, Edward C. 2009. The Results from the Lifetime Value and Customer Equity

Modeling Competition. Journal of Interactive Marketing, Vol. 23 (2009), pp. 272-275.

Malthouse, C.E and Blattberg, C.R. (2005) ‘Can we predict customer lifetime value’, Journal

of Interactive Marketing, Vol. 19, No. 1, pp.2 – 16.

Mahboubeh Khajvand, and Mohammad Jafar Tarokh. Estimating customer future value of

different customer segments based on adapted RFM model in retail banking

context.. Procedia CS, (3):1327-1332, 2011.

Reinartz, Werner, Jacquelyn Thomas and V. Kumar (2005), “Balancing Acquisition and

Rentension Resources to Maximize Customer Profitability”, Journal of Marketing, 69 (1), 63-

79.

Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning

representations by back-propagating errors". Nature 323 (6088): 533–536

Rust, R. T., K. N. Lemon, and V. A. Zeithaml (2004), ‘Return on marketing: Using customer

equity to focus marketing strategy’. Journal of Marketing 68, 109-127.

Thomas, Jacquelyn (2001), “A methodology for linking customers acquisition to customer

retention”, Journal of Marketing Research, 38 (2), 262-68.

Page 34: Customer Lifetime Value Measurement using Machine Learning ...€¦ · I declare that the summer internship project report entitled, “Customer Lifetime Value Measurement using Machine

31

Thomas, J.S., Blattberg R.C., and Fox, E.J. (2004, February), “Recapturing lost customers”,

Journal of Marketing Research, 41, 31-45.

V. Kumar, Customer Lifetime Value – The path to profitability, Foundations and Trends in

Marketing, vol 2, no 1, pp 1-96, 2007.

Vapnik, V., 1998. Statistical Learning Theory Wiley. New York.

Venkatesan, R. and V. Kumar (2004), 'A customer lifetime value framework for customer

selections and resource allocation strategy'. Journal of Marketing, 68, 106-125 (October).

Villanueva J., S. Yoo, and D.M. Hanssens, "The Impact of Marketing-Induced vs. Word-of-

Mouth Customer Acquisition on Customer Equity," Journal of Marketing Research, February

2008.

Vinay Kumar, K., V. Ravi, Mahil Carr, N. Raj Kiran: Software development cost estimation

using wavelet neural networks. Journal of Systems and Software 81(11): 1853-1867 (2008)

Yoo S. & D.M. Hanssens, "Modeling the Sales and Customer Equity Effects of the Marketing

Mix," revised, February 2005, working paper, University of California, Los Angeles, Anderson

School of Management.

Zhang. Q, 1997. Using wavelet network in non-parameters estimation. IEEE Transaction

Neural Networks 8 (2): 227~236