CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal:...

10
CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS - Senior Design 2013-2014 * Charu Jangid [email protected] Univ. of Pennsylvania Philadelphia, PA Trisha Kothari [email protected] Univ. of Pennsylvania Philadelphia, PA Jarred Spear [email protected] Univ. of Pennsylvania Philadelphia, PA Edward Wadsworth [email protected] Univ. of Pennsylvania Philadelphia, PA ABSTRACT The estimation of Customer Lifetime Value (CLV) is one of the core pillars in strategy development and marketing. CLV is a measurement in dollars associated with the long term re- lationship between a customer and a company, revealing how much that customer is worth over a period of time. CLV is rather predictively powerful when considering customer ac- quisition processes, as well as for selecting optimal service levels to provide different customer groups. Current methods for estimating CLV involve building a single model using the entire population of customers as in- put. This not only loses the granularity in the data, but also gives rise to poor targeting and strategic advertising for consumers. This paper seeks to show the advantages of com- bining smaller, targeted models intelligently in order to build separate models for customers buying different product cat- egories. This enables an effective use of the data that firms have about customers to make intelligent strategic decisions. Using a sample dataset, the proposed implementation of Multitask learning ([2]), in which knowledge learned isshared between related categories, yields a strong improvement in CLV forecast accuracy as compared to using a single, large model on product categories with lower numbers of transac- tions (less than 150). Additionally, this same method re- duces the standard deviation of error when compared to the single large model. Most significantly, the Multitask learning models tend to perform better than single models when cate- gories have sparse data to train on, traditionally considered a harder task. These results indicate that Multitask learning techniques can lead to a better outcome than current industry standards, and perhaps is a better alternative to the existing methodology. 1. INTRODUCTION In marketing, Customer Lifetime Value (CLV) is a predic- tion of the net profit attributed to the entire future relation- ship a company has with a customer. Accurate and timely calculation of CLV is important because knowing which cus- tomers are likely to be very valuable to any given company, as well as those which will cause the firm to lose money, * Advisor: Dr. Eric Eaton ([email protected]). can help greatly with tailoring the marketing, product offer- ings, and the purchasing experience to specific customer seg- ments in order to maximize revenues and profits [15]. CLV calculations allow firms to build a picture of which type of customers are valuable or under which circumstances they become valuable. This can drastically improve the value of money spent on customer acquisition and marketing. This paper provides a method for using CLV to predict which product categories are valuable, which is a particularly dif- ficult estimation in new categories with sparse data. This paper proposes a model for calculating CLV with- out losing the granularity in the data, unlike other mod- els. Whereas many methods used today take all of any given customer’s transactions into account when calculat- ing their CLV, the proposed model will subdivide each cus- tomer’s transaction by product category, followed by calcu- lating each customer’s CLV specific to any given product category. For example, rather than calculating the CLV of a customer who has bought both televisions and cameras, the proposed model would calculate a separate CLV for that cus- tomer for each product category (the total CLV would then be the sum of those two CLVs). This, however, introduces a problem: since each category will have less data to build a model upon than the original model, some category’s mod- els will be weak and largely overfitted if their data is used in isolation. To solve this problem without compromising the advan- tages of tuning models to specific categories, the model will incorporate Multitask learning. Multitask learning is a branch of machine learning consisting of several algorithms for shar- ing knowledge between different tasks. Product categories which have fewer transactions (and therefore inherently weaker models) will be bolstered by information gained from more popular and related product categories. This method leaves popular categories’ models relatively unchanged, while greatly strengthening less popular categories’ models. In short, the proposed model is designed to address the following problem statement: Given sample transactions, predict Customer Lifetime Value at the product category level, even if data per category is sparse. The model will take as input a predefined transaction log,

Transcript of CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal:...

Page 1: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

CustoVal: Estimating Customer Lifetime Value UsingMachine Learning Techniques

Dept. of CIS - Senior Design 2013-2014∗

Charu [email protected]

Univ. of PennsylvaniaPhiladelphia, PA

Trisha [email protected]

Univ. of PennsylvaniaPhiladelphia, PA

Jarred [email protected]

Univ. of PennsylvaniaPhiladelphia, PA

Edward [email protected]

Univ. of PennsylvaniaPhiladelphia, PA

ABSTRACTThe estimation of Customer Lifetime Value (CLV) is one ofthe core pillars in strategy development and marketing. CLVis a measurement in dollars associated with the long term re-lationship between a customer and a company, revealing howmuch that customer is worth over a period of time. CLV israther predictively powerful when considering customer ac-quisition processes, as well as for selecting optimal servicelevels to provide different customer groups.

Current methods for estimating CLV involve building asingle model using the entire population of customers as in-put. This not only loses the granularity in the data, butalso gives rise to poor targeting and strategic advertising forconsumers. This paper seeks to show the advantages of com-bining smaller, targeted models intelligently in order to buildseparate models for customers buying different product cat-egories. This enables an effective use of the data that firmshave about customers to make intelligent strategic decisions.

Using a sample dataset, the proposed implementation ofMultitask learning ([2]), in which knowledge learned is sharedbetween related categories, yields a strong improvement inCLV forecast accuracy as compared to using a single, largemodel on product categories with lower numbers of transac-tions (less than 150). Additionally, this same method re-duces the standard deviation of error when compared to thesingle large model. Most significantly, the Multitask learningmodels tend to perform better than single models when cate-gories have sparse data to train on, traditionally considereda harder task. These results indicate that Multitask learningtechniques can lead to a better outcome than current industrystandards, and perhaps is a better alternative to the existingmethodology.

1. INTRODUCTIONIn marketing, Customer Lifetime Value (CLV) is a predic-

tion of the net profit attributed to the entire future relation-ship a company has with a customer. Accurate and timelycalculation of CLV is important because knowing which cus-tomers are likely to be very valuable to any given company,as well as those which will cause the firm to lose money,∗Advisor: Dr. Eric Eaton ([email protected]).

can help greatly with tailoring the marketing, product offer-ings, and the purchasing experience to specific customer seg-ments in order to maximize revenues and profits [15]. CLVcalculations allow firms to build a picture of which type ofcustomers are valuable or under which circumstances theybecome valuable. This can drastically improve the value ofmoney spent on customer acquisition and marketing. Thispaper provides a method for using CLV to predict whichproduct categories are valuable, which is a particularly dif-ficult estimation in new categories with sparse data.

This paper proposes a model for calculating CLV with-out losing the granularity in the data, unlike other mod-els. Whereas many methods used today take all of anygiven customer’s transactions into account when calculat-ing their CLV, the proposed model will subdivide each cus-tomer’s transaction by product category, followed by calcu-lating each customer’s CLV specific to any given productcategory. For example, rather than calculating the CLV of acustomer who has bought both televisions and cameras, theproposed model would calculate a separate CLV for that cus-tomer for each product category (the total CLV would thenbe the sum of those two CLVs). This, however, introduces aproblem: since each category will have less data to build amodel upon than the original model, some category’s mod-els will be weak and largely overfitted if their data is usedin isolation.

To solve this problem without compromising the advan-tages of tuning models to specific categories, the model willincorporate Multitask learning. Multitask learning is a branchof machine learning consisting of several algorithms for shar-ing knowledge between different tasks. Product categorieswhich have fewer transactions (and therefore inherently weakermodels) will be bolstered by information gained from morepopular and related product categories. This method leavespopular categories’ models relatively unchanged, while greatlystrengthening less popular categories’ models. In short, theproposed model is designed to address the following problemstatement:

Given sample transactions, predict Customer Lifetime Valueat the product category level, even if data per category issparse.

The model will take as input a predefined transaction log,

Page 2: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

and it will output a CLV for each customer in each productcategory they have made purchases in. Additionally, use-ful graphics and other visualizations are produced based onthe output to make the data more accessible and useful inderiving key insights.

This document examines related efforts at solving thisproblem, then outlines the schema and software implementa-tion of the proposed system, and finally discusses the resultsof the system on sample data and looks towards potentialimprovements. First, it reviews the current research in cal-culating CLV so as to show how this approach is novel anduseful. Second, it describes the proposed model and its cor-responding step-by-step implementation of an algorithm topredict CLV using Multitask learning. Then, it expoundson the evaluation criteria to be used to determine the suc-cess of this project and how well the model compares tothese criteria. Finally, suggestions for future improvementsor expansions are put forward and explained.

2. RELATED WORKSeveral models have been used over the years to quantify

CLV. They have ranged from simplistic to complex, and theyhave incorporated ideas from diverse fields such as mathe-matics, econometrics, and computer science. The Multitasklearning model makes use of some of these models, and aimsto improve on all of them when data is sparse in some areasbut rich in others.

2.1 RFM ModelOne of the first attempts to gauge customer value was a

system called the RFM model, standing for Recency, Fre-quency, and Monetary Value. These models were originallyintended to determine how successful a direct marketingcampaign (e.g. postcards) would be to each customer, indi-vidually. They incorporated only three variables: the timebetween now and the customer’s last purchase, the averagetime between a customer’s purchases, and how much moneythe customer spends on any given purchase (Wei et al. [19]).The most typical application of this model would break cus-tomers up by quintiles on each of the three factors beinglooked at, yielding 125 groups of customers. Based on someformula, the costs of marketing to each cell and the predictedvalue from each cell would be calculated, and a breakevenline would be formed. The company would then deploy itsmarketing campaign to only target those cells which it con-siders could be profitable.

This technique can be fairly readily applied so as to deter-mine CLV. The technique is simple, intuitive, and does notrequire a large amount of complicated data. Essentially, acompany wishing to determine CLV of its customers wouldassume a constant direct marketing campaign for the re-mainder of each customer’s lifetime, and then determine theprofitability of each customer (Cheng et al. [3]). However,this technique has a number of drawbacks. The conceptof a constant marketing campaign is impractical; the cam-paign itself would be incredibly expensive and the customerswould eventually grow immune to it. Looking only at trans-action histories without accounting for demographic factorscould lead to major oversights. Unless the formula is manu-ally tweaked to backtest better, there is no impetus for themodel to learn from past successes and failures.

The proposed approach will improve on the basic RFMmodel in a variety of ways. Most prominently, the tech-

nique incorporates learning, thereby increasing the strengthof the model over time. It also accounts for sequential data,rather than selecting a fixed time period and looking only atthat bucket of data. Finally, as has been shown elsewhere(Gupta et al. [8]), models which incorporate more than thesethree factors do a better job predicting CLV than RFM mod-els. Although RFM was not developed explicitly to calculateCLV, it is an important benchmark to be compared against.

2.2 Econometric ModelsSome models exist that seek to take more covariates into

account than probability models, which typically only userecency, frequency and monetary value of transactions inorder to estimate the different elements of CLV.

An example of this is the use of proportional hazard mod-els to estimate how long customers will remain with the firm.The general form of these equations is

λ(t,X) = λ0(t) exp(βX)

where λ is the hazard function, which is the probabilitythat the customer will leave the firm at time t given thatthey have remained with the firm until time t, and X is aset of covariates which may depend on time. λ0 is the basehazard rate; typical examples include the exponential model(in which case it is memoryless) or the Weibull model (tocapture time dependence). See Cox [4] for the original paperon proportional hazard models, and Knott et al. [11] for anexample of use.

Once this hazard rate has been established, the survivorfunction is calculated as

S(t) = P (T ≥ t) = exp

− t∫0

λ(u)du

Where T is the actual time that the customer leaves the

firm. For the exponential distribution, the hazard functionis constant; this makes estimating the survival function rel-atively simple. There is also the following:

P (T < t) = 1− S(t)

If the time that customers leave the firm is known, thelikelihood of customers leaving can be estimated. Hence, topredict the function parameters (β), the total likelihood canbe maximized (or minimize log likelihood).

While this use of covariates take more into considerationthan probability models, it focuses on aspects of CLV inde-pendently (for example, treating customer retention and theamount that customers spend separately). The proposed ap-proach treats CLV as a final goal, allowing the shared basis,a representation of basis vectors to define each of the tasks,to act across these components.

Furthermore, these models treat all customers equally.The proposed approach will be able to extract more fromthe given data by dividing the customers into sections, whilestill allowing each segment to learn from the others via theshared basis.

2.3 Persistence ModelPersistence models focus on modeling the behavior of com-

ponents such as acquisition, retention and cross selling. These

Page 3: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

models can be used to study how a change in one variable(such as a customer acquisition campaign) impacts othersystem variables over time. This approach can be used tostudy the impact of advertising, discounting and productquality on customer equity. It projects the long run or equi-librium behavior of a variable or group of variables of inter-est.

For example, a firm’s acquisition campaign may be suc-cessful and bring in new customers (consumer response).That success may prompt the firm to invest in additionalcampaigns (performance feedback) and possibly finance thesecampaigns by diverting funds from other parts of its market-ing mix (decision rules). At the same time, the firm’s com-petitors, fearful of a decline in market share, may counterwith their own acquisition campaigns (competitive reaction).Depending on the relative strength of these influence mecha-nisms, a long-run outcome will emerge that may or may notbe favorable to the initiating firm. Dynamic systems can bedeveloped to study the long run impact of a variable and itsrelationship with other variables.

Persistence models are well suited for CLV because it isa long term performance metric. Such models help quan-tify the relative importance of the various influence mech-anisms in long-term customer value development, includingcustomer selection, method of acquisition, word of mouthgeneration, and competitive reaction. However, this ap-proach cannot perform well when there is little data to workwith, as it relies on consumer behaviors that only revealthemselves over longer periods of time.

2.4 Diffusion ModelUnlike the other approaches, the prime focus of the dif-

fusion model is on Customer Equity (CE). The purpose ofthe more aggregated approach is that CLV often restricts it-self to focus on customer selection, customer segmentation,campaign management, and customer targeting. A broaderapproach of integrating the CLV of both current and fu-ture customers can help produce a strategic metric usefulfor higher-level executives.

In essence, CE is the sum of the CLV of future and cur-rent customers. There are two key approaches used in themeasurement of CE. The first is the production of probabil-ity models of acquiring a certain consumer using disaggre-gate data (Thomas 2001 [17]; Thomas, Blattberg, and Fox2004 [18]). This is primarily the methodology used by theapproaches described so far. The alternate approach is touse aggregate data and diffusion/growth models to predictnumber of customers likely to be acquired in the future.

Gupta, Lehmann, and Stuart [9] showed that CE is a goodestimate for 60% of the five companies investigated. Theexceptions included eBay and Amazon, an indication of theweaknesses for the model to be applied with larger online re-tail firms. A particularly interesting insight of this approachis the relative importance of marketing and financial instru-ments, where a 1% change in retention would negativelyaffect the CE by 5%. This is in stark contrast with a similarchange in discount rate producing only a 0.9% change in CE.For example, a $45 million expenditure by Puffs facial tis-sues to increase its ads awareness ratings by 0.3 points willcompound to eventually result in producing a $58.1 millionin CE.

Diffusion models can also be used to assess the value of alost customer (Hogan, Lemon, Libai [10]). The key premise

of this proposition is that firms that lose a consumer wouldnot only lose the customer’s CLV, but also the word-of-mouth effect generated by that consumer. This indirect ef-fect can be compounded to a four-fold effect, as investigatedin the online banking industry. The above proposition of in-corporating lost consumers in the model is of key importancein prediction of lifelong customer value, due to the triggeringof a butterfly effect on the strength of the customer base.

2.5 Probability ModelsResearching existing techniques for calculating CLV shows

that the Pareto/NBD (Schmittlein, Morrison and Colombo[16]) is one of the most widely used methods. The proposedmodel applies the Pareto/NBD to each separate productcategory. In Fader et al. [7], a paper that implements thePareto/NBD model, CLV is defined as follows:

CLV = margin× transaction revenue×DET

Margin: How much profit the firm makes per sale to thiscustomer. This data is typically unavailable to researchers(for big companies, financial statements may be useful), butdoes not really affect trends in data. It is a value between 0and 1.

Transaction Value: How much the customer is expectedto spend during each transaction. Multiplied together withmargin, this gives the monetary value of each transaction tothe firm.

DET - Discounted Expected Transactions: This is the to-tal number of transactions a customer is expected to makein her lifetime. Transactions that are further in the futureare discounted in order to take the time value of money intoaccount.

For model fitting purposes, it may not be realistic to cal-culate lifetime transactions; this is a difficult number to findin order to test against (since customers may stay with a firmfor many years). A cleaner approach for model fitting is tochoose a certain time horizon and to estimate the numberof transactions that a customer will do in the period; thiscan then be compared to actual data for training. Similarly,if margin information is unavailable, it would be better toleave it from calculations while training models. Therefore,a new metric, “Future Customer Spend”, is defined to betested against. Note that once models have been estimated,the parameters can be used to build full CLV estimationsagain.

FCS(t) = trans(t)× transaction revenue

Where trans(t) is the expected number of transactions inthe next t periods.

Two factors need to be estimated to calculate FCS: thenumber of future transactions a customer is likely to make,and expected revenue per transaction. Following Fader etal. [7], these factors are assumed to be independent and aremodeled separately, using the Pareto/NBD and Gamma/Gammamodels.

2.5.1 Pareto/NBDThe Pareto/NBD model is used to predict the future trans-

actions a customer will make, and relies on the followingassumptions:

• Customers are “alive” for a certain amount of time,after which they are permanently inactive.

Page 4: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

• While alive, customer i’s purchasing behavior followsa Poisson process with transaction rate λi.

• Heterogeneity in transaction rates across customersfollows a gamma distribution with parameters r andα:

g(λi|r, α) =αrλr−1

i e−λiα

Γ(r)

• The time that a customer i remains active follows anexponential distribution with dropout rate µi.

• Heterogeneity in dropout rates across customers fol-lows a gamma distribution with parameters s and β:

g(µi|s, β) =βsµs−1

i e−µiβ

Γ(s)

• λi and µi vary independently across customers.

This technique uses only the length of time the customerwas observed for, the time of the customer’s last transaction,and how many transactions the customer made. All otherinformation, such as demographic information or time ofearlier transactions, is unused.

Note that a model for any customer’s number of transac-tions is obtained with four parameters which are the samefor the entire population: r, α, s, and β.

For a detailed note on implementation and derivation ofthe model, see Fader and Hardie [6].

2.5.2 Gamma/GammaThe Gamma/Gamma model is used to predict how much

a customer will spend on each transaction. Assumptions:

• The amounts customer i spends on transactions arei.i.d. gamma variables with shape parameter pi andscale parameter vi.

• p is constant across customers, and vi is distributedacross customers according to a gamma distributionwith shape parameter q and scale parameter γ.

Note that a model for any customer’s total spending pertransaction is obtained with three parameters which are thesame for the entire population: p, q, and γ.

For more details on implementation and derivation, seeFader et al. [7].

2.6 Multitask learningMultitask learning includes a broad set of tools and tech-

niques to learn several related tasks through a shared rep-resentation. One of the strengths of this approach is theability of the model generated to leverage the commonal-ities in the data. Through the learning of several relatedtasks in parallel, each task learned can be improved.

Multitask learning has shown to work well particularly insituations of sparse data ([12]). In situations of sparsity, eachtask can be represented as a linear combination of underlyingbasis tasks with sparse coefficients. The degree of sharingcan be correlated with the amount the tasks have an inherentoverlap in their sparsity pattern. This can enable capturegranularity in the data, by leveraging the similarity in basisrepresentation patterns.

3. SYSTEM MODELCurrent industry standards follow a relatively simple ap-

proach to CLV estimation, since most firms take the prod-uct of past transaction values and the forecast period in thefuture to provide a numerical value of CLV. More refinedmodels use the Pareto/NBD and the Gamma/Gamma inorder to predict future CLV, as shown in the Related Workssection.

First, a brief overview of the proposed model is provided.Second, the current existing standards of probability modelsare examined and their shortcomings are noted. Third, theproposed model is expounded upon to provide to delve intomore details. Finally, some key applications of the modelare listed.

The proposed system extends current naive and advancedindustry standards by taking the model forward by incor-porating multitask learning. When the solutions to cer-tain tasks are related through some underlying structure,an agent may use multitask learning to share knowledgebetween tasks. In the proposed model, first, the transac-tions are segmented by product category to produce an in-dividual model for each of the categories. Second, an in-dividual model is produced for each of the product cate-gories, outputting parameters for the Pareto/NBD and theGamma/Gamma distribution. Third, sparse coding is per-formed on the shared basis, which includes all parametersfor each of the product categories, refining the parameters ofthe individual categories further. Finally, the refined param-eters obtained from the multitask approach are then used tocalculate the expected value of CLV for a certain time span.

Figure 1 explains how the model splits products by prod-uct categories and applies a Pareto/NBD and Gamma/Gammamodel to each category individually. In Figure 1, samplesdrawn from product categories 1, 2, and 3 are sufficient to fitthe statistical model individually to each of those categories.For simplicity of the figure, α and β are used to representmodel parameters rather than the full seven parameters thatwill be used in reality. For each model that fitted, param-eter vectors α and β describe the product category. In thefigure, α1 and β1 are the model parameters for category 1,α2 and β2 are the parameters for category 2, while α3 andβ3 are parameters for category 3.

Consider a new product category, product category 4,with sparse data. Data is considered to be sparse if it has atraining set of less than 150 data points to produce a well-fitted model. The model encounters the statistical prob-lem that a small sample may not capture the full range ofthe population. Any model estimated for the above cate-gory would be subject to overfitting, primarily due to thelack of data points for the generation of an accurate model.This can have drastically negative effects, including misal-location of resources and ineffective strategy and planning.This problem may arise either because product category 4 isnew, or because of inherent changes in the market structureyielding less data.

At this point, the single model approach would be to sim-ply fit a model across all the product categories and thenapply that model’s parameters to describe category 4 (alongwith all other categories). The separate model approachaims to avoid discarding granularity; the implementation ofMultitask learning is used to solve this problem instead (asillustrated in Figures 2 and 3).

Multitask learning enables knowledge acquisition for re-

Page 5: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

Figure 1: Fitting per category models. Category4 is for a new product category in which a limitednumber of transactions have been observed.

Figure 2: The proposed model lies somewhere be-tween single category models and a large sharedmodel.

Figure 3: Shared knowledge is obtained by fitting amodel to the parameters of the individual models.The new shared model is then used to supplementthe parameters for the sparse models.

lated tasks through a shared representation. Such a formof learning provides an opportunity for categories with in-sufficient data to learn from previously obtained knowledgethrough models that have sufficient data to produce a goodfit. It is crucial to note that the degree of sharing is notabsolute, but rather catered to each dataset. This producesa generic model for CLV estimation, rather than a carefullycatered representation to the data being used. After the in-dividual models are estimated using probability models, ashared basis is produced. The shared basis is a set of lin-early independent vectors, which in a linear combination,can represent the entire vector set. This shared basis in-cludes rows of the seven parameters estimated, four fromthe Pareto/NBD and three from Gamma/Gamma, for eachof the product categories.

In the paradigm of multitask learning, several related tasksof CLV estimation are learned together by sharing knowl-edge between the tasks. It is important to note that thetasks in the shared basis can selectively share information

with other tasks, resulting in a more independent and dy-namic system. Consider the representation of each task pa-rameter vector to be a linear combination of a finite numberof underlying basis tasks. The coefficients of the linear com-bination are generally sparse for the tasks being estimated.Moreover, there is an overlap in the sparsity patterns be-tween different tasks, which can determine the degree ofsharing in the model. Since each task is only representedby a vector of seven parameters, the model exists in a low-dimensional subspace allowing the tasks to overlap with eachother in multiple bases. This form of Machine Learning isknown as sparse coding, a technique used to represent an in-put signal by a small number of basis functions [13]. Sparsecoding outputs a basis set and a weight vector, which, whencombined, reconstructs the signal to integrate shared knowl-edge from other tasks. This product of the basis matrix andthe weight vector hence produces a new matrix of the refinedparameters for each of the product categories.

Consider a finite but large set of transaction data in thetraining set X = [x1, x2, x3...xn] in Rm×n. Here, n is thenumber product categories being modelled, and m is theseven parameters described above. Let D in Rm×k be thedictionary representation, wherein each column representsa basis vector. Hence, X admits a sparse approximationover the dictionary with a atoms. Let L(x,D) be a lossfunction which must be minimized for a well fitted D. Thenumber of product categories, n, is larger than the sevenmodel parameters implemented, as is typical for applicationsof this type of learning. Moreover, the degree of sharing, k, isat most equal to n, since sparse coding is implemented ratherthan an overcomplete dictionary. This degree of sharing isa control knob that can be moved along the spectrum tofind the optimal solution. The loss function L(x,D) is theoptimal value of the Lasso regularization problem with theregularization parameter λ:

L(x,D) = minα∈Rk1

2||x−Dα||22 + λ||α||1

Here, λ is a regularization parameter. A fundamentalcharacteristic of this technique is that this l1 penalty resultsin a sparse solution for α in the basis pursuit. Sparse cod-ing hence enables a representation of shared basis vectors,which further refine the individual parameters.

One of the challenges of sparse coding is the determina-tion of the optimal degree of sharing. The degree of sharingmay lie in the spectrum between completely individualizedmodels to a single model with all data being shared. In or-der to determine the best degree of sharing, the model usesan out-of-sample testing procedure to compare the CLV esti-mates produced by the shared representation with the actualCLV. The k producing the model with the least percentageerror is considered optimal. Hence, the development set isleveraged to ensure that an overfitted model is not chosen.These parameters are then used to estimate the conditionalexpected value of the amount a customer will spend, in or-der to determine the value of the CLV per category for eachof the customers. For any customer, the aggregate of theCLV for all product categories is the projected future CLV.

A robust implementation of the system is provided, andexperiments are run to compare the system with existingmodels. This enables testing the validity and reliability ofthe proposed model. Further details are provided in Sec-tion 4.

Page 6: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

Figure 4: System implementation pipeline.

4. SYSTEM IMPLEMENTATIONIn order to compare the proposed model to existing mod-

els, two different models are being generated using the dataavailable. One uses the existing single model approach inindustry, and the second applies the proposed solution (theshared knowledge model). In order to evaluate the strengthof the proposed model, it was tested on data comprising82,000 anonymized rows of transactions of a large online re-tailer, Amazon, from January 2012 to December 2012. Thedata (see Appendix A) contains rows of transaction data,consisting of several variables such as time of transaction,the relevant product category, item purchased, customer ID,zip code, average income level, etc. Note that since the datais anonymized, for CLV estimation, each customer is identi-fied with their unique customer ID. The data was obtainedfrom ComScore, through the Wharton Research Data Ser-vices (WRDS).

Before starting, transactions are divided into training, de-velopment, and testing sets, with the training set preced-ing the development and testing set chronologically. A twomonth span is used for each of the sets. As described in Sec-tion 3, the metric used to test results is Future CustomerSpend (FCS). The training period will be used to forecasthow much customers will spend in the testing period. Thedevelopment set is used to determine the degree of sharingwhich would be optimal for the given dataset. This is thenverified by observing actual spending behavior in the testperiod.

To measure the effectiveness of building separate modelsfor each category, the following steps are conducted:

i. The Pareto/NBD and Gamma/Gamma distributions areused to estimate CLV per customer using all transac-tions observed in the training period for each category;

ii. These parameters are then fed into a matrix containingthe 7 parameters defining all the product categories;

iii. Sparse coding with 100 iterations is performed on thematrix, with an increasing degree of sharing;

iv. The matrix containing the refined parameters obtainedfor each degree of sharing is then used to estimate theconditional expected value of the CLV on the develop-ment set. This process is repeated for each of the matri-ces produced in the training set for every single degreeof sharing;

v. For each of the degrees of sharing, the estimated CLVproduced by the refined parameter matrix is comparedto the actual CLV of the customers. The optimal de-gree of sharing is decided to be the one with the leastabsolute percentage error in CLV.

vi. The parameter matrix of the optimal degree of sharingis then used to calculate CLV over the testing period,enabling a robust methodology of CLV estimation.

vii. For evaluation of the system, the absolute percentageerror in CLV estimation by the system model is com-pared to that of existing industry standards.

The entire system is an end-to-end pipeline implementedin MATLAB. The pipeline coded takes in as input the filecontaining the rows of transactions, and outputs the CLV ofeach customer. In order to perform Sparse coding, the open-source Sparse Modelling Software developed by the FrenchInstitute for Research in Computer Science and Automationis used, in particular the mexTrainDL library ([14]). Thisenables a generic and intuitive system, whereby users can fo-cus on primarily catering to the market, rather than puttingin arduous efforts in plugging discrete pieces together.

The end-to-end pipeline also enabled running of severalexperiments to determine if certain dependant variables havea statistically significant effect. A key experiment to test ifthe model performs better with sparse data was run by in-creasing the number of buckets the data was divided into.The number of buckets was modified to range between 5 and8, in order to vary the sparsity of the training and develop-ment data used. Through tuning in the input parametersto the system, the entire package was made modular, lead-ing to fewer errors in model design as well as a much moreintuitive user experience.

Key insights are visualized through a consumer facing webapplication which plots the geographic distribution of thecustomer lifetime values across all product categories usingthe Google Maps API. Besides the Google Maps API, theapplication also makes use of the Bootstrap framework forstylistic elements, the D3 framework ([1]) for rendering barand pie charts, and the Springy.js framework ([5]) for forcedirected distance graphs that are used to show relationshipsbetween different product categories. More details are pro-vided in Section 6.

Page 7: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

5. RESULTSThe effective system implementation described above leads

to key results to test the validity and reliability of the pro-posed model. The overall results are obtained from compar-ing the shared model approach with the individual modeland the industry standard (overall model). There are twoprime findings of this model that extends industry stan-dards: (i) Multitask learning produces much more robustresults in product categories with sparse data; (ii) Sharingknowledge decreases the variance in error observed.

In order to evaluate the estimation error for the sharedand individual model approaches, the absolute percentageerror for each product category is plotted against the num-ber of customers in the each product category. As can beseen in Figure 5, for product categories with less than a150 customers who made transactions, the shared modelapproach brings the absolute error down to as low as 0.6compared to the 0.9 error obtained by the individual model.Therefore, this shows that in order to model product cat-egories with data points fewer than 150, the shared modelapproach should be used. This also proposes that a combi-nation model would work best for the entire data set: Usethe shared model approach to model product categories withfewer than 150 customers who have made transactions, anduse the individual model approach for product categorieswith more than a 150 such customers.

Figure 5: The shared model approach improves es-timation of CLV for small product categories.

Other than the absolute error per category, it is also inter-esting to measure the mean absolute error across categories.The fraction of customers taken into account in the model isvaried, and the mean absolute error is calculated for both theshared model and the industry standard model approaches.For different fractions of customers, the average error is sim-ilar between the proposed model and the industry standard.According to this measure of error, the proposed model ison track with current implementations, but presents better

recall than existing models.Figure 6 indicates the difference between the standard de-

viation of error for the different fractions of customers takeninto account in the models. Much lower standard deviationis found in the shared model as compared to the industrystandard approach. This gives a strong signal of indicationthat CLV can be calculated more consistently across cate-gories in the shared model approach, and lead to an overallbetter product category fit. This is likely due to the factthat a model built using all categories heavily favors largecategories, leaving small categories prone to high degreesof error. Once again, this may suggest a hybrid-model ap-proach for an overarching solution as described above.

Figure 6: The shared model approach leads to lowervariation in error estimations.

A final study was done to test the stability of results. Forcomparison purposes, the degree of sharing k was fixed at5. Then, three models were trained on four months of data,and tested on two months of subsequent data in order tofind the error in estimation.

Seg. 1 Seg. 2 Seg. 3 Seg. 4Mean Size 105.34 274.43 411.00 703.60Categories 116 40 6 5

Individual Model 4.38 0.52 0.36 0.35Overall Model 3.89 1.27 0.52 0.24Shared Model 1.68 0.59 0.71 0.35

Table 1: Mean error rates for the different modelsare shown. The shared model performs best withsmall product categories.

This process was repeated five times with five time peri-ods, starting on January 1, January 16, February 1, Febru-ary 15, and March 1. All categories were then pooled anddivided into the four segments as shown in Table 1. Segment1 contains all categories with fewer than 200 customers mak-ing transactions in the training period; segment 2 contains

Page 8: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

all categories with more than 200 and fewer than 400 suchcustomers; and segments 3 and 4 follow the pattern (400-600and 600-800 such customers respectively).

The results of this study is shown in Table 1. Even re-peated across a larger sample, it still holds true that theshared model outperforms the other models for small prod-uct categories, while remaining competitive for larger prod-uct categories.

6. FINAL PRODUCTThe final product, CustoVal, is a consumer facing front-

end web application that uses JavaScript, HTML, CSS, PHPand Python technologies. Once the output from the CLVanalysis is obtained, it is transferred via a python script tothe cloud and stored in a format that can be directly readfrom the application.

The application focuses on providing visualizations and in-sights to marketing professionals to enable informed decision-making on how much to spend on targeting customers froma specific product category and geographic location, as wellas how to maximize a customers spend across all productcategories. There are two modes the application can existin: (i) The overall insight mode, highlighting key insights forall product categories, and (ii) The per category mode, en-abling the user to pick a product category and view insightsspecific to that category.

The overall view is shown in Figure 7. The Google MapsAPI effectively highlights the correlation between geographiclocation and CLV, by comprehensively depicting the geo-graphic layout of future customer transactions. The cus-tomers are represented as circles on the US map in the loca-tion where they made their purchase. The size of the circlesrepresents the transaction value and overlapping transac-tions result in circles of higher color intensities indicatinghigher frequencies of transactions. In addition to geograph-ical distribution, the “all category” view also provides threepie charts that show the share of transactions made in eachof the product categories for all customers, the top 10% ofthe customers, and the top 5% of the customers. The piecharts for the top 5

The per category view, displayed in Figure 8, also providesa geographical visualization but only of the customers thatmade purchases in that product category. In addition, thisview provides the customer acquisition cost for that productcategory, and also a force graph showcasing the relationshipof the product category with others. The spring graph isdeveloped by calculating the mean distance of the productcategory parameter from the parameters of other productcategories. This visualization is meant to provide marketingprofessionals with insights as to which product categoriesare similar to which others and how they can cross sell tocustomers across product categories.

Along with the geographic rendering, several other toolssuch as the Bootstrap and D3 framework ([1]) are used forrendering the graphs and visualizing the data in an intuitivemanner. Moreover, the Springy.js framework ([5]) is used todepict force directed distance graphs, indicating the corre-lation between different product categories observed in thedata.

These are the features currently available in Version 1of CustoVal allowing firms to enable research in MachineLearning to make strategic decisions regarding customer ac-quisition and targeting. There are several opportunities for

Figure 7: The overall view allows marketers to viewa comparison between product categories.

the types of applications the proposed model could have onthe industry, especially in enhancing marketing strategiesand reducing customer acquisition costs. Some of the mostimportant applications are enumerated below:

1. Recover important customer spending potential acrossdifferent product categories, which allows for more spe-cific targeting of high value customers for a productcategory through marketing campaigns.

2. Improve retargeting and ROI (Return On Investment)for customers by showing them ads in other productcategories of value to them.

3. Accurately predict customer spending potential for newproduct categories that have not seen many transac-tions yet.

On a concluding note, the key goal of this proposed modelis to truly extend the current industry benchmark on esti-mating Customer Lifetime Value. These metrics can helpfirms leverage the data on customers, particularly in situa-tions with sparse data, such as the launch of a new productcategory or a fundamental change in tastes or the marketstructure. Future iterations of the product will involve amore integrated pipeline that enables marketing managersto upload spreadsheets in excel format, which is used often inindustry. Future work also involves providing more insights

Page 9: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

Figure 8: The individual category view providesa more in-depth analysis of the customers of eachproduct category.

on new product categories customers could potentially bevaluable in and directly designing ad campaigns for Googleand Facebook Ad Platforms.

7. ETHICSThis model carries relatively few unexpected ethical issues

beyond ones which are native to customer analytics. Thethree chief problems are economic risk to a company whichuses the model, privacy issues for the customers who arebeing analyzed, and stereotype propagation.

A company which used this model to inform its customeracquisition or other marketing strategies would be exposedto economic risk if the projections failed to reflect reality.That is, a company might misallocate resources and suffermonetary losses by using the model. This risk is commonto any system to predict CLV, and very little can be doneto mitigate this risk except for continuous improvement ofmodels. It is crucial to note, however, that the proposedmodel must not be used as a magical cure for CLV estima-tion, but rather as an extension to current industry stan-dards for greater accuracy in conditions of sparsity.

Use of this model, or any CLV system which uses dataon individual customers, also poses an ethical issue withregards to the privacy of customers. Customers may notrealize that their individual transaction records are beingscrutinized and used to build future, possibly personalizedmarketing campaigns. This problem can be ameliorated byanonymizing customers before the model is used. Thoughthis limits the ability of a company to build personalizedcampaigns, the data yielded would still be useful, and mostprivacy concerns would be allayed.

Another potential source of future contention could be thenegative effects of targeting consumers. Several customersmay have growing tastes, and estimating their importance toa firm, particularly in the early stages of their development

of tastes, could have a spillover effect on the advertisingthey are subject to, as well as the products they purchasein the future. This can lead to a self-reinforcing feedbackloop, which may result in tastes in a firm being extremelycategorical and discrete. This can have larger social im-pact issues through stereotype and image propagation. Thiscan be prevented by not taking extremely new and develop-ing consumers into account in the development of the CLVmodel, as well as being careful in ensuring that the adver-tisement prevents reinforcements of negative portrayals andstereotypes.

8. FUTURE DEVELOPMENTModels of customer purchasing behavior typically fall into

four categories, defined by type of purchasing behavior andtype of customer relationship. Purchasing behavior can beeither discrete (e.g. a company with quarterly donation re-quests) or continuous (like most grocery stores: you canbuy at any time). Customer relationships can be contrac-tual (e.g. subscription-based services like cable television)or non-contractual.

CustoVal uses the Pareto/NBD, which applies to non-contractual continuous-time purchasing. The reasoning be-hind this was that most purchasing follows this behavior, butit is the most difficult to model. Hence, any improvementover current systems would have an impact. Future devel-opment of CustoVal would take the direction of exploringthe other types of purchasing. The pipeline design allowsfor replacing only very specific parts of the system, whichwould make this process easier.

Furthermore, the system could be implemented as a hy-brid model approach rather than strictly using the individ-ual, overall, or shared models. It has been seen that theshared model does best on small product categories, whilethe larger product categories perform better with individualmodels. Thus, it would be worth investigating an implemen-tation which chooses the optimal type of modelling to usebased on product category size. Further research would needto be done in order to find the point at which to distinguishbetween small and large categories.

Moreover, the proposed system model weighed each of thecategories equally for CLV estimation. This tended to workwell since a sparse history would indicate a lower estimatedCLV value. However, instead of using primarily history oftransactions, the model could potentially extend itself totake characteristics of the consumer such as income, age,and gender to account. This could result in the productcategory models being weighed differently, with each of theweights learned from the characteristics of the consumers.This could incorporate an interesting element of customerfeatures in analytics, providing more room for experimentalresearch.

9. REFERENCES[1] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer.

D3 data-driven documents. Visualization andComputer Graphics, IEEE Transactions on,17(12):2301–2309, 2011.

[2] Rich Caruana. Multitask learning. Machine learning,28(1):41–75, 1997.

[3] Ching-Hsue Cheng and You-Shyang Chen. Classifyingthe segmentation of customer value via rfm model and

Page 10: CustoVal: Estimating Customer Lifetime Value Using ...cse400/CSE400_2013_2014/reports/13...CustoVal: Estimating Customer Lifetime Value Using Machine Learning Techniques Dept. of CIS

rs theory. Expert Systems with Applications,36(3):4176–4184, 2009.

[4] David R Cox. Regression models and life-tables.Journal of the Royal Statistical Society. Series B(Methodological), pages 187–220, 1972.

[5] Jos de Jong and Giovanni E Pazienza. Browser-basedgraph visualization of dynamic data with visgraph. InGraph Drawing, page 518. Springer.

[6] Peter S Fader and Bruce GS Hardie. A note onderiving the pareto/nbd model and relatedexpressions, November 2005.

[7] Peter S Fader, Bruce GS Hardie, and Ka Lok Lee.Rfm and clv: Using iso-value curves for customer baseanalysis. Journal of Marketing Research, pages415–430, 2005.

[8] Sunil Gupta, Dominique Hanssens, Bruce Hardie,Wiliam Kahn, V Kumar, Nathaniel Lin, NaliniRavishanker, and S Sriram. Modeling customerlifetime value. Journal of Service Research,9(2):139–155, 2006.

[9] Sunil Gupta, Donald Lehmann, and JenniferAmes Stuart. Valuing customers. Journal of marketingresearch, pages 7–18, 2004.

[10] John E Hogan, Katherine N Lemon, and Barak Libai.What is the true value of a lost customer? Journal ofService Research, 5(3):196–208, 2003.

[11] Aaron Knott, Andrew Hayes, and Scott A Neslin.Next-product-to-buy models for cross-sellingapplications. Journal of Interactive Marketing,16(3):59–75, 2002.

[12] Abhishek Kumar and Hal Daume III. Learning taskgrouping and overlap in multi-task learning. arXivpreprint arXiv:1206.6417, 2012.

[13] Julien Mairal, Francis Bach, Jean Ponce, andGuillermo Sapiro. Online dictionary learning forsparse coding. In Proceedings of the 26th AnnualInternational Conference on Machine Learning, pages689–696. ACM, 2009.

[14] Julien Mairal, Francis Bach, Jean Ponce, andGuillermo Sapiro. Online learning for matrixfactorization and sparse coding. J. Mach. Learn. Res.,11:19–60, March 2010.

[15] Phillip E Pfeifer. On the use of customer lifetime valueas a limit on acquisition spending. Journal ofDatabase Marketing, 7:81–86, 1999.

[16] David C Schmittlein, Donald G Morrison, andRichard Colombo. Counting your customers: Who-arethey and what will they do next? Managementscience, 33(1):1–24, 1987.

[17] Jacquelyn S Thomas. A methodology for linkingcustomer acquisition to customer retention. Journal ofMarketing Research, pages 262–268, 2001.

[18] Jacquelyn S Thomas, Robert C Blattberg, andEdward J Fox. Recapturing lost customers. Journal ofMarketing Research, pages 31–45, 2004.

[19] Jo-Ting Wei, Shih-Yen Lin, and Hsin-Hung Wu. Areview of the application of rfm model. AfricanJournal of Business Management, 4(19):4199–4206,2010.

APPENDIXA. DATA

From ComScore on the Wharton Research Data Services(WRDS) database, it is possible to obtain transaction datafor major websites such as Amazon.com, with Machine ID(a unique identifier for each customer), transaction date andtime, purchase amount and details, session details, demo-graphic information, and zip code for each customer. Theproposed model uses only the Machine ID, the number oftransactions made by each customer, the date of the lasttransaction, and the average amount spent per transactions.Zip code is used in order to produce a geographic represen-tation of the data.