redacte - dspace.mit.edu
Transcript of redacte - dspace.mit.edu
Uplift Modeling for Randomized Experiments and
Observational Studies
by
Xiao Fang
MASSACHUSETTS INSTITUTEOF TECHNOLOGY
MAR 26 2018
LIBRARIES
B.S., University of Science and Technology of China (2011)S.M., Massachusetts Institute of Technology (2013)
Submitted to the Department of Electrical Engineering and ComputerScience
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2018
o Massachusetts Institute of Technology 2018. All rights reserved.
Author ......................... redacteDepartment of Electrical Engineering and Computer Science
December 1, 2017
Certified by........... Sig tue edac d ...............David Simchi-Levi
Professor, Institute of Data, Systems, and Society,Department of Civil and Environmental Engineering,
Operations Research CenterThesis Supervisor
Accepted by .............. Signature redacted.L e'liY A. Kolodziejski
Professor of Electrical Engineering and Computer ScienceChair, Department Committee on Graduate Students
ARCHIVES
2
Uplift Modeling for Randomized Experiments and
Observational Studies
by
Xiao Fang
Submitted to the Department of Electrical Engineering and Computer Scienceon December 1, 2017, in partial fulfillment of the
requirements for the degree ofDoctor of Philosophy in Computer Science
Abstract
Uplift modeling refers to the problem where we need to identify from a set of treat-ments the candidate that leads to the most desirable outcome based on subject char-acteristics. Most work in the last century focus on the average effect of a treatmentacross a population of interest, but ignores subject heterogeneity which is very com-mon in real world. Recently there has been explosion of empirical settings whichmakes it possible to infer individualized treatment responses.
We first consider the problem with data from randomized experiments. We putforward an unbiased estimate of the expected response, which makes it possible toevaluate an uplift model with multiple treatments. This is the first evaluation met-ric of uplift models aligning with the problem objective in the literature. Based onthis evaluation metric, we design an ensemble tree-based algorithm (CTS) for upliftmodeling. The splitting criterion and termination conditions are derived with theconsideration of the special structure of uplift modeling problem. Experimental re-sults on synthetic data and industry data show the advantage of our specialized upliftmodeling algorithm over separate model approach and other existing uplift modelingalgorithms.
We next prove the asymptotic properties of a simplified CTS algorithm. Theexhaustive search for locally optimal splitting points makes it difficult to theoreticallyanalyze tree-based algorithms. Thus we adopt dyadic splits to CTS algorithm andobtain the bound of regret-expectation of performance difference between oracle andour algorithm. The convergence rate of the regret depends on the feature dimension,which emphasizes the importance of feature selection. While model performanceusually improves with the number of features, it requires exponentially more datato approximate the optimal treatment rule. Choosing the appropriate complexity ofthe model and selecting the most powerful features are critical to achieving desirableperformance.
Finally we study the uplift modeling problem in the context of observational stud-ies. In observational studies. treatment selection is influenced by subject characteris-tics. As a result. baseline characteristics often differ systematically between different
3
treatments. Thus confounding factors need to be untangled before valid predictionsare made. We combine a modification of the standard feed-forward architecture withour CTS algorithm to optimize predictive accuracy and minimize feature distributiondistance between treatment. Experimental results on synthetic data show that thecombination of neural network feature representation and ensemble tree-based modelis promising to handle real-world problems.
Thesis Supervisor: David Simchi-LeviTitle: Professor, Institute of Data, Systems, and Society,Department of Civil and Environmental Engineering,Operations Research Center
4
This thesis is dedicated to my grandmother, Caichun Hong,
January 12, 1933-September 1, 2017.
5
Acknowledgments
This thesis is a reflection of my rewarding journey at MIT. I am fortunate and privi-
leged to be accompanied and supported by many incredible people.
First and foremost, I would like to express my deepest gratitude to my advisor,
Professor David Simchi-Levi. His keen vision of research and high professional stan-
dard has made a huge influence on me. David has taught me how to be an effective
researcher and the art of guiding research projects from start to finish. David gave
me a lot of freedom in exploring new ideas and experimenting with new methods,
while provided critical insights to guide the direction. This thesis would not have
been possible without his guidance, support and encouragement.
I would like to thank my thesis committee members, Prof Peter Szolovits and
Prof Luca Daniel, for taking time to serve on my committee, and for their critical
insights to help improve the quality of this thesis. I would greatly appreciate their
commitment and flexibility during such a busy schedule.
I am also lucky to have collaborated with wonderful people during my PhD re-
search life. The work in Chapter 2 and 3 was done in close collaboration with Yan
Zhao. We together discussed research, programmed codes and wrote papers. We have
been through a lot of ups and downs, paper rejections and acceptances. Conversa-
tions with Yan either on research or in life always brought me a lot of inspiration and
fun. I can not ask for a better partner and friend than her. This thesis is funded
by MIT-Accenture Alliance, therefore I have the opportunity to work with numerous
amazing people at Accenture - Jason Coverston, Matthew O'Kane, and Andrew Fano.
Their perspective in the business field helped shape the project more established.
I would like to express my gratitude to all my friends, locally available and geo-
graphically separated. Their friendship and support make my life always enjoyable.
Special thanks to Xue Feng, my ex-officemate, ex-roommate, and to-be-workmate,
who has become my closest friend after traveling the world together, and sharing
moods at different stages of life. To my friends at MIT: Yin Wang, Tianli Zhou,
Linsen Chong, Nate Bailey, Yiqun Hu and Peter Zhang, it is my great pleasure to
6
share the office space, together with so much joy, with them. All the days and nights
spent with them in and outside the office are the best memories in my PhD life.
Finally, I owe the deepest appreciation to my parents, for their unconditional love
and constant support. They instilled me the value of education and love for learning
at a very young age. They always encourage me to pursue my dream with their best
wishes, even though I have to be thousands of miles away. None of my achievements
would be possible without their understanding.
7
8
Contents
1 Introduction 17
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.1 Uplift Modeling Problem . . . . . . . . . . . . . . . . . . . . . 17
1.1.2 Challenges and Literature Review . . . . . . . . . . . . . . . . 18
1.2 O verview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Uplift Modeling for Randomized Experiments 27
2.1 Problem Formulation and Model Evaluation . . . . . . . . . . . . . . 27
2.1.1 Problem Formulation and Notation . . . . . . . . . . . . . . . 27
2.1.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 The CTS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Advantage of CTS Over SMA . . . . . . . . . . . . . . . . . . 37
2.3.2 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.3 Priority Boarding Data . . . . . . . . . . . . . . . . . . . . . . 43
2.3.4 Seat Reservation Data . . . . . . . . . . . . . . . . . . . . . . 47
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3 Theoretical Analysis 55
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.1 Multi-Armed Bandit Problem . . . . . . . . . . . . . . . . . . 55
3.1.2 Tree-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Consistency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9
3.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Algorithm CTS.0 . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.2 Upper Bound on Expected Regret . . . . . . . . . . . . . . . . 67
3.4 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 Uplift Modeling for Observational Studies 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 A lgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 Synthetic Data 1 . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.2 Synthetic Data 2 . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.3 Simulation based on Real Data . . . . . . . . . . . . . . . . . 88
4.4 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 Conclusion 91
5.1 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A Parameter Selection for CTS 95
B Variable Selection 97
C Parameter Selection for NN-CTS 99
10
List of Figures
2-1 An example of uplift curve [Rzepakowski 2012] . . . . . . . . . . . . . 30
2-2 True decision boundary of optimal treatment on the 2-D example . . 38
2-3 Comparison of performance between different approaches on the 2-D
exam ple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2-4 Comparison of decision boundary between different approaches on the
2-D exam ple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2-5 Averaged expected response of different algorithms on the synthetic
data. The error bars are 95% margin of error computed with results
from 10 training datasets. For each data size, all algorithms are trained
with the same 10 datasets. . . . . . . . . . . . . . . . . . . . . . . . . 42
2-6 Change in revenue per passenger with respect to passenger features . 44
2-7 Expected revenue per passenger from priority boarding based on dif-
ferent m odels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2-8 Modified uplift curves of different algorithms for the priority boarding
d ata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2-9 Expected summed response per passenger from priority boarding with
the consideration of user utility. . . . . . . . . . . . . . . . . . . . . . 48
2-10 Expected response per passenger from priority boarding with the con-
sideration of user utility. . . . . . . . . . . . . . . . . . . . . . . . . . 49
2-11 Expected revenue per passenger from seat reservation when applying
different pricing models. . . . . . . . . . . . . . . . . . . . . . . . . . 51
2-12 Modified uplift curves of SMA-RF and CTS on the seat reservation data. 52
11
3-1 The optimal treatment rule for the 2D example. The vertical boundary
along X1 axis is located at X1 = 50. The vertical axis X2 is a discrete
variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3-2 The treatment rule reconstructed by UCTS and CTS for the 2D ex-
ample. Plots on each are generated from the same training set. The
horizontal axis is feature X1 and the vertical axis feature X2 . The la-
bels and ticks of the axes are the same as those in Fig. 3-1 and omitted
here for sim plicity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3-3 Average expected response under UCTS and CTS models for the 2D
example computed from 50 training sets with 95% confidence interval
shown as the error bar. . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3-4 Illustration of dicretization error in 2-D space. The diagonal curve
represents the true decision boundary, with left bottom part assigned
to treatment yellow and right upper part assigned to treatment blue. 64
3-5 An example of feature space partition and treatment assignment for
Algorithm CTS.0 when Xd = [0, 1]2 and M = 10. Subspace #1 is
assigned to treatment yellow and subspace 02 to an arbitrary treatment. 65
4-1 Causal graphical models of randomized experiment (left) and observa-
tional studies (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4-2 Neural network architecture for uplift modeling . . . . . . . . . . . . 81
4-3 Averaged expected response of different algorithms on the synthetic
data simulating observational studies. The error bars are 95% margin
of error computed with results from 10 training datasets. For each
data size, all algorithms are trained with the same 10 datasets. ..... 84
4-4 Feature distribution after t-distributed stochastic neighbor embedding
(t-SN E ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
12
4-5 Averaged expected response of different algorithms on the synthetic
data simulating randomized experiments. The error bars are 95% mar-
gin of error computed with results from 10 training datasets. For each
data size, all algorithms are trained with the same 10 datasets. ..... 87
5-1 The change of average revenue for different treatments over time with
90% confidence band (red: default price, blue: high price) . . . . . . . 92
13
14
List of Tables
1.1 Literature on uplift modeling . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Standard errors of 10 repeated experiments for datasets simulating
observational studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Standard errors of 10 repeated experiments simulating randomized ex-
perim ent data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Standard errors for 10 repeated experiments on IHDP dataset . . . . 89
B. 1 Variable selection on seat reservation data . . . . . . . . . . . . . . . 97
C.1 Parameters and ranges . . . . . . . . . . . . . . . . . . . . . . . . . . 99
15
16
Chapter 1
Introduction
1.1 Background
1.1.1 Uplift Modeling Problem
In real world, we often face the treatment selection problem, where we need to identify
from a set of alternatives the candidate that leads to the most desirable outcome. For
example, airline companies need to set the price that generates the highest revenue
for priority boarding; doctors want to know which medication is the most effective to
reduce blood pressure.
Most work in the last century focus on the average effect of a treatment across a
population of interest, but ignores the subject heterogeneity which is very common
in real world. A price level that leads to the highest revenue might tend to put off
passengers in some subpopulation. A medical treatment that is the most effective
over the entire patient population may be ineffective or even detrimental for patients
with certain condition. This limitation is because historically most datasets have
been too small to meaningfully explore heterogeneity of treatment response, while
recently there has been explosion of empirical settings which makes it possible to
infer individualized treatment responses based on subject characteristics.
The prediction of optimal treatment based on subject characteristics is of great
interest and critical applications. This personalized treatment selection problem, also
17
referred as Uplift Modeling in the literature, is the main topic of this thesis.
1.1.2 Challenges and Literature Review
Neyman [Neyman 1935] and Fisher [Fisher 1935] discussed the use of randomized ex-
periments to attribute difference in responses to different treatments. Randomized
experiments, also called randomized controlled trials, are considered the gold stan-
dard approach for estimating treatment effects. In randomized experiment, subjects
are randomly assigned a treatment and their responses are recorded. Uplift model-
ing also crucially relies on randomized experiments. Random treatment allocation
ensures that treatment status will not be confounded with either measured or un-
measured baseline characteristics. Therefore, the treatment effects can be estimated
by comparing responses between different treatments [Greenland 1999].
At the same time, a widespread accumulation of observational data is available
in many fields such as healthcare, economics, and education. For data collected in
observational studies, we do not have control or even a full understanding of the
mechanism how treatment is assigned to each subject, while subject characteristics,
treatment, and outcomes are all recorded. For example, a dataset records patients'
diabetic status to one of two anti-diabetic medications, as well as other lab test re-
sults. However, which medication is taken by the patient may depend on the patient's
gender/past diagnose/economic level, which is unknown to us. Though without con-
trol of the treatment assignment procedure, we still want to make the use of these
data to predict the optimal treatment based on given subject characteristics as well,
which needs to be taken more care of compared to data obtained from randomized
experiments.
Despite the fact that every training example is accompanied with a response value,
uplift modeling is different from conventional classification/regression problems. This
key difference poses unique challenges. In classification/regression problem, one needs
to train different models and select the one that yields the most reliable prediction. It
requires sensible cross-validation strategies along with potential feature engineering,
where we separate data into a training set and a test set, learn on the training
18
data, predict response on the test data and compare to the ground truth. However,
data collected in uplift modeling is unlabeled because we can only observe response
under the assigned treatment, and would never know responses under alternative
treatments, thus it is impossible to know whether the chosen treatment is optimal
for any individual subject. This is known as the Fundamental Problem of Causal
Inference [Holland 19861. The lack of the true value of the quantity that we are
trying to predict (the optimal treatment) makes it impossible to use standard machine
learning algorithms to estimate the optimal treatment directly.
One approach to Uplift Modeling is the Separate Model Approach (SMA). Data
is separated into groups by treatments, and a prediction model is built for each
treatment. When a new subject arrives, we assign the treatment that has the best
predicted response.
The main advantage of this approach is its simplicity. It does not require devel-
opment of new algorithms and software, and any state-of-the-art machine learning
algorithms can be used in either a regression setting or a classification one. If uplift,
the difference in response between treatments, is strongly correlated with the class
attribute itself, or if the amount of training data is sufficient for the models to predict
response accurately, SMA will perform very well. We have seen applications of SMA
in [Hansotia 2002] for direct marketing and in [Manahan 2005] for customer retention.
However, when the uplift signal is much weaker than the response and the uplift
follows a distinct pattern from the response, SMA is often outperformed by other
methods [Lo 2002][Radcliffe 2011]. One reason is the mismatch between training ob-
jective and problem purpose. SMA aims to train models which can accurately predict
the response under a single treatment, what really matters is accurately identifying
the treatment with best response. Since data is always noisy and insufficient, there
is error for each model trained under a single treatment, and these errors will be ac-
cumulated when estimating uplift after the comparison of these predicted responses
and selecting the optimal treatment afterwards. [Radcliffe 20111 illustrates this phe-
nomenon in a simulation study.
SMA will also lead to misguided feature selection. For example, the separate model
19
will prioritize features that have high predictive power for the response of individual
treatment, while ignore features that best predict the difference in responses across
treatments. When the cross-treatment feature effect is relatively weak compared to
the individual-treatment feature effect and when the training data is not abundant,
the separate model approach will focus more on the individual-treatment feature
effect, which is not what we want.
Because of the disadvantage of the Separate Model Approach, a number of al-
gorithms have been proposed to directly model the uplift effect. One approach is
the Class Transformation method introduced in [Jaskowski 20121. In the case of bi-
nary response variable and balanced dataset between treatments, it creates a new
target variable, which can be calculated from training data, and builds a relation-
ship between the new variable and individualized treatment effect. However the two
assumptions make it too restrictive. [Athey 2016] generalized it to unbalanced treat-
ment assignment and continuous response variables by the incorporation of propensity
score.
A lot more research aims to model uplift directly through the modification of
well-known classification machine learning algorithms, both parametric and non-
parametric methods. [Lo 2002] proposed a logistic regression formulation which ex-
plicitly includes interaction terms between features and treatments. [Zaniewicz 20131
adapts Support Vector Machine to predict the treatment effect. Nonparametric meth-
ods are also broadly used and have achieved good performance for real world applica-
tions. [Alemi 20091 and [Su 20121 used an adaption of K-Nearest Neighbors for uplift
modeling. Treatment assignment is based on the empirically best one as measured on
the K training data that are closest to each subject. The most popular and successful
approach is tree-based algorithms. [Chickering 2000] modifies the standard decision
tree construction procedure [Breiman 1984] by forcing a split on the treatment at each
leaf node. [Hansotia 2002] employs a splitting criterion that maximizes the difference
between the difference between the treatment and control probabilities in the left
and right child nodes. [Rzepakowski 2010] chooses splitting points that maximize the
distributional difference between two child nodes as measured by weighted Kullback-
20
Leibler divergence and weighted squared Euclidean distance. [Radcliffe 2011] fits a
linear model to each candidate split and uses the significance of the interaction term
as a measure of the split quality. [Guelman 2014a] selects the variable that has the
smallest p-value in the hypothesis test on the interaction between the response and
itself as the splitting variable. Then the splitting point is selected to maximize the
interaction effect. [Rzepakowski 20151 uses Bagging or Random Forest on uplift trees
and demonstrates experimentally that it results in significant performance improve-
ment.
However, all the above research can only handle binary treatment, which limits
their applications to real world problems. Literature on the more general multi-
treatment uplift problem is limited. The causal K-nearest neighbors originally in-
tended for single treatment in [Alemi 20091 and [Su 2012] can be naturally gener-
alized to multiple treatment. [Rzepakowski 2012] extends the tree-based algorithm
described in [Rzepakowski 2010] to multiple treatment cases by using the weighted
sum of pairwise distributional divergence as the splitting criterion. [Chen] proposes
a multinomial logit formulation in which treatments are incorporated as binary fea-
tures and explicitly includes the interaction terms between treatments and features.
The finite sample convergence guarantees for model parameters and out-of-sample
performance guarantee is also presented. Both [Rzepakowski 2012] and [Chen] can
handle both binary and discrete responses. In Table 1.1, we list existing literature in
this field and classify them on the responses being binary, discrete, or continuous and
whether there is single treatment or multiple treatments. This thesis fills up the va-
cancy with being able to handle multiple treatments and either discrete or continuous
responses.
21
Single treatmentMultiple treatments
(treatment vs control)
Chickering (2000), Lo (2002),
Hansotia (2002), Alemi (2009),Binary Rzepakowski (2012),
Rzepakowski (2010), Radcliffe (2011),Response Chen et. al (2015)
Su (2012), Zaniewicz (2013),
Guelman (2014)
Discrete Alemi (2009), Radcliffe (2011), Rzepakowski (2012),
Response Rzepakowski (2012), Su (2012) Chen et. al (2015)
Continuous Alemi (2009), Su (2012),This Thesis
Response Radcliffe (2011)
Table 1.1: Literature on uplift modeling
One critical challenge with uplift modeling is how to accurately evaluate model
performance off-line using experimental data given the counter-factual nature of the
problem. In the current literature, people used qini curves [Radcliffe 20071 and uplift
curves [Rzepakowski 2010] in the case of binary treatment. The uplift curve measures
the gain in cumulative response as a function of the percentage of population targeted.
This is useful to see if the treatment has a global positive or negative effect by targeting
part of the population, but we are not able to tell the expected response if treatments
are assigned following the uplift model. Being unable to measure the key quantity
that the problem actually aims to maximize makes it impossible to do parameter
tuning and model selection.
We face further challenges when encountered with data from observational stud-
22
ies. Cochran defined an observational study to be an empirical investigation in which
the "objective is to elucidate cause-and-effect relationships ... [in settings in which]
it is not feasible to use controlled experimentation, in the sense of being able to im-
pose the procedures or treatments whose effects it is desired to discover, or to assign
subjects at random to different procedures" [Cochran 19651. Lack of randomization
in observational studies may result in large difference on the observed subject charac-
teristics between different treatments. These differences can lead to biased estimates
of treatment effect. The observed treatments depend on variables which might also
affect the response, and thus this results in confounding.
Propensity score, first defined by [Rosenbaum 1983] as the probability of treatment
assignment conditional on subject characteristics, allows one to design and analyze an
observational study so that it mimics a randomized experiment. The propensity score
is a balancing score so that in a set of subjects whose propensity scores are the same,
the observed subject characteristics distribution will be the same between different
treated groups. While in randomized experiments the true propensity score is defined
by the study design, it is not known in observational studies. Logistic regression is
frequently used to estimate this score, in which the treatment variable is the outcome
and subject characteristics are the predictor variables. Different methods using the
propensity score, like matching, stratification, or regression adjustment, can produce
unbiased estimates of the treatment effects [Austin 20111. However, these methods
can only estimate average treatment effect over the whole population rather than
an individual level effect, and they do not attempt to directly model the relationship
between characteristics, treatments, and responses. Seeing these limitations, we would
like to have a method which can integrate feature balancing and response predicting
well and be able to predict optimal treatment at an individual level
1.2 Overview
We now describe the contribution of this thesis. First we put forward an unbiased esti-
mate of the expected response for an uplift model with data collected from randomized
23
experiments. This evaluation metric works for arbitrary number of treatments and
general response types, and does not require the treatment probability to be evenly
distributed. This opens up the application of uplift modeling to broader areas. We
further introduce the modified uplift curve which provides a consistent illustration of
the uplift ability of a model.
We then propose a tree-construction procedure with a splitting criterion that ex-
plicitly optimizes the performance of the tree as measured on the training data and
a set of termination conditions with the consideration to avoid some extreme cases.
An ensemble of trees is used to mitigate the overfitting problem that commonly hap-
pens with a single tree. We refer to our algorithm as the CTS algorithm where
the name stands for Contextual Treatment Selection. The performance of CTS is
tested on three benchmark data sets. The first is a 50-dimensional synthetic data set.
The latter two are randomized experiment data provided by our industry collabora-
tors. On all of the data sets, CTS demonstrates superior performance compared to
other applicable methods which include Separate Model Approach with Random For-
est/Support Vector Regression/K-Nearest Neighbors/AdaBoost, and Uplift Random
Forest (upliftRF) as implemented in the R uplift package[Guelman 2014J.
To obtain the asymptotic properties of the tree-based algorithm, we adopt the idea
of dyadic decision trees, which split only at the midpoint of each dimension. This
family of decision trees can approximate complex decision boundaries, while makes
the theoretical analysis tractable. With the help of the special structure of dyadic
decision trees, we prove that a simplified algorithm is asymptotically optimal under
mild regularity conditions, with a convergence rate dependent on the data size and
feature dimension.
In contrast to expensive experimental data, observational data is often available
in much larger quantities especially in fields such as medicine, economic, or sociology.
In the case of observational studies, we use a modified neural network architecture to
achieve a balanced distribution. The new features are fed as input to the ensemble
tree model to predict optimal treatment. Experimental results on synthetic data
show that the combination of neural network feature representation and tree-based
24
uplift modeling has achieved better performance than state-of-the-art approaches.
Moreover, it has shown advantage for randomized experiment data when training
dataset is small. This property is particularly helpful at the early stage of randomized
experiments.
The remainder of the thesis is organized as follows. In Chapter 2, we first introduce
the formulation of the multi-treatment uplift modeling, and present the unbiased esti-
mate of the expected response for an uplift model, then describes the CTS algorithm
in detail. The setup and the results of the experimental evaluation are presented
afterwards. The modified uplift curve is also introduced in this Chapter. Chapter 3
presents a generic treatment selection algorithm and the analysis of its convergence
property. We then introduce the algorithm to combine feature representation and
CTS for uplift modeling under observational studies in Chapter 4. Finally we provide
some concluding remarks in Chapter 5.
25
26
Chapter 2
Uplift Modeling for Randomized
Experiments
2.1 Problem Formulation and Model Evaluation
We first describe the mathematical formulation of uplift problems and the notation
used throughout this paper.'A new method to evaluate uplift modeling is introduced
afterwards.
2.1.1 Problem Formulation and Notation
During the experiment to collect data, we record features X E Xd, the assigned
treatment T E {1, , K}, and the response Y for each subject. In the feature
vector, we use subscripts indicate specific features, i.e. X, is the jth feature and xj
its realization. For each subject, there are K potential responses under each potential
treatment, denoted by the subscripts Y1 , Y2 , ..., YK. The larger the value of Y, the
more desirable the outcome. The value of Y is decided by X and T, and we care
about the conditional expectation
p(x, t) = E[YIX = x, T = t] (2.1)
'Upper case letters are used to denote random variables and lower case letters their realizations,boldface for vectors and normal typeface for scalars.
27
In the price optimization example mentioned in the Introduction where the com-
pany wants to maximize the revenue, X would be passenger's characteristics such as
flight date/time, flight fare, the number of people traveling together, etc; T would be
different price levels; and Y would be the revenue .
We make an assumption that the treatment assignment T is independent of the
K potential outcomes conditional on X. This property can be expressed as uncoun-
foundness [?]
{Yl, Y2 , ... , YK} _UI T | X
Unconfoundness makes the estimation of counterfactual outcomes from nearest-neighbor
matching and other local methods consistent in general.
Suppose we have a dataset SN = { ( X(), t( , y(i) ), i = 1, ... , N } containing N
joint realizations of (X, T, Y) from a randomized experiment. Superscript (i) is to
index the samples. The objective of uplift modeling is to learn a mapping h from
the feature space to the treatment space: Xd -+ {0, 1,... , K} such that the expected
response conditional on this mapping E[YIT = h(X)] is maximized.
As is with classification and regression problems, the optimal expected response
achievable in an uplift problem is determined by the underlying data generation pro-
cess. The optimal treatment assignment rule h* satisfies the point-wise optimal con-
dition, i.e., Vx E Xd,
h*(x) E arg max E[YX = x, T = t]. (2.2)t=1,...,K
We would like to learn an uplift model with a treatment rule h to be close h*.
An evaluation metric is prerequisite to measure the gap between these two and help
model training.
2.1.2 Model Evaluation
Evaluation of uplift models is not easy compared to machine learning in the more
standard supervised setting. In uplift modeling, we can only observe the subject
28
response under one and only one treatment, and will not know the ground truth-
optimal treatment for each subject. The only way to obtain ground truth is to actually
run the algorithm on live experiment, which is economically inefficient or practically
unfeasible. Even when it is possible to apply the model in live experiment, it would
be helpful if we could get a sense of how approximately the model will work before
conducting live experiment. Therefore, we need an evaluation metric using offline
data collected at a previous randomized experiment.
Because of the impossibility to observe responses under all the treatments for
an individual, it is difficult to find a loss measure for each observation. As a re-
sult, most of the uplift literature resort to aggregated measures such as bins of up-
lifts in [Ascarza 2016], uplift curves in [Rzepakowski 2012], [Rzepakowski 20151 or
[Nassif 2013], and qini measure in [Radcliffe 2007]. Uplift curve is put forward in the
context of binary treatment, where there is a sample of the population selected and
subjected to the treatment, called treatment set, and another selected sample is not
applied by the treatment, called control set. We set aside a test set for treatment
and control group and score them using the model. We subtract the response on the
p percent highest scored subjects in the control set from the response on the highest
scored p percent subjects in the treatment set as an estimation of the gain if perform-
ing the treatment on the p most preferable subjects. The result of such a subtraction
is called an uplift curve, with an example shown in Fig. 2-1. The straight line with a
positive slope represents the beneficial global effect if the whole population is offered a
treatment randomly. In the uplift curve, each point corresponds to the inferred uplift
gain. The higher this value, the better the model. Uplift curve is useful to see if the
treatment has a global positive or negative effect and help us choose the percentage
that maximizes the gain as the limit of the population to be targeted. The continuity
of the uplift curves makes it possible to calculate the area under the curve as a way
to evaluate and compare different uplift models. This measure is thus similar to the
well-known AUC (Area under ROC curve) in a binary classification setup.
The problem with uplift curve is that there is no guarantee that the highest scoring
examples in the treatment and control groups are similar. Moreover, uplift curve is
29
18- Euclid
16- -- KL.DoubleTree
14 - -- DeltaDeltaP'- "-~~14. .. . .. . ......
2 - --- --- ------ ------- --- ------ -.-- --- ---------- -------------- ---------- --- -
CL-
0
E 6 - - - -........ . .... .. .... ..... ... ..... .... .. . .. ..... .
U
S20 40 60 so 100Treated objects [%]
Figure 2-1: An example of uplift curve [Rzepakowski 2012]
not an ideal evaluation metric as it does not measure the key quantity of interest-the
expected response. Take the airline pricing for example. Suppose in training data,
the average revenue for the control group is $4.5 and that for the treatment group is
$5.1. We are not able to tell from the uplift curve what the average revenue will be if
we follow the model and offer customers with the price that the model predicts will
generate higher revenue. This raises the necessity of an evaluation metric aligning
well with the problem objective directly.
The new evaluation metric is based on the introduction of a new random variable,
which can be calculated from recorded samples. When a randomized experiment is
conducted, treatments are assigned randomly and independently from the features.
Suppose the probability that a subject assigned to treatment t is pt. Note that
treatments are not necessarily evenly distributed. This particularly happens when
we would like to learn the treatment effect over a default one, and set aside a small
portion of the population exposed to new treatment while keep the majority not
30
affected by the experiment at all. We assume that pt > 0 for t = 0, ... , K to exclude
trivial cases. See Lemma 1 for the definition and the property.
Lemma 1. Given an uplift model h, define a new random variable
K
Z = Y x [{h(X) = t}I{T = t} (2.3)= Pt
where l[{} is the 0/1 indicator function. Then
E[Z] = E[YIT = h(X)].
Proof. The proof is straightforward using the law of total expectation.
E [Z]= : -E [Y x fh (X) t}I IT =t] P{T = t
KE ] E Y I{ h(X) = t] }|T =t]P{ = t}
= SE[Yh(X) =t,T =t]P{h(X)=t}t=o
= E[Ylh(X) = T]
The definition of z consists of K terms, where there is at most one term not equal
to 0 and the nonzero term can only exist when the assigned treatment matches the
predicted treatment. 2.4 gives a more explicit expression of the calculation of z from
an observed sample {x, t, y}.
0 if t h h(x),Zi (2.4)
y/pt if t =h(x).
Lemma 1 builds the relationship between the new variable and the expected re-
sponse our problem aims to optimize. This, together with the fact that the sample
31
average is an unbiased estimate of the expectation, leads to the following theorem.
Theorem 2. The sample average
N W
Z = N(2.5)i=1
is an unbiased estimate of E[YIT = h(X)].
To our knowledge, this is the first evaluation method applicable to evaluate uplift
models offline. Moreover, we can compute the confidence interval of which helps
to estimate the possible range of E[YIT = h(X)]. With the definition of Z, we are
able to evaluate model performance on an existing dataset obtained from a random-
ized experiment offline before applying the model to live experiment. This guidance
particularly helps us.
Being able to estimate expected response with training data, we introduce mod-
ified uplift curve, which provides an operating characteristic of an uplift model.
We compute the difference in expected response under the predicted optimal treat-
ment and the control for each subject given an uplift model, and then rank the test
subjects by the difference from high to low. The top p percent of the test subjects
are assigned to the predicted optimal treatment, and the rest to the control. The
expected response under this new assignment rule is then estimated with Eq. (2.5).
The resulting graph with horizontal axis as the percentage of population subject to
treatments and the vertical axis as the expected response is modified uplift curve.
The reason that we introduce such a curve is that in real world there usually
exists a certain level of risk if subjects are exposed to treatments, such as disruption
to customer experience, unexpected side effects, etc. As a result, the experimenter
wants to limit the percentage of population exposed to treatment while still obtaining
as much benefit from personalization as possible. We can select an operating point at
the modified uplift curve to achieve the balance between the percentage of population
exposed to treatment and the expected response according to practical needs.
32
2.2 The CTS Algorithm
Decision trees are used successfully in many diverse areas such as character recog-
nition, medical diagnosis, and speech recognition, etc. Its capability to break down
a complex decision-making process into a collection of simpler decisions makes it
powerful in performance and easy in interpretation. However, a single tree suffers
from overfitting as the predictions of a single tree are highly sensitive to noise in the
training set, thus Random Forest is put forward to reduce the variance without in-
creasing the bias [Breiman 2001]. The idea of random forest is that an aggregation of
many different decent-looking trees outperformed over a single highly-optimized tree
[Breiman 1984]. It applies the general technique of bootstrap sampling and feature
bagging, to tree learners. An ensemble of trees is built, where each tree is con-
structed using a random sample with replacement of the training set. And only a
random subset of features are considered for splitting at each step to select a split.
This bootstrapping procedure de-correlate the trees by showing them different train-
ing sets and more balanced selections of different features as splits, which helps reduce
variance and smooths sharp decision boundaries. The success of random forest has
been proven in theory and in practice [Delgado 2014]. We have also seen excellent
performance of tree ensembles in uplift modeling problem [Rzepakowski 2015].
We follow the tree ensemble methodology to generate a forest for uplift modeling.
We refer to it as CTS algorithm which stands for Contextual Treatment Selection.
With the ability to assign the subspace-wise optimal treatment, CTS would approxi-
mate the true point-wise treatment assignment. Two key elements to build a tree is
the splitting criterion and termination conditions. We will introduce how we design
them for the customization to uplift modeling problem.
A tree grows by recursive binary splitting from the roof node, where the roof node
represents the feature space Xd and each split creates two subspaces further down. At
a node corresponding to the feature space #, we should decide whether to do a split
and how to do the split if needed. Without the split, the highest expected response
33
is obtained by picking the best treatment for 4
P(#) = max E[Y X E 0, T = t] (2.6)t=1,...,K
The benefit of adding a split s that divides 4 into the left child-subspace i1 and
the right child-subspace 0r, is that we can select the best treatments for the child-
subspace correspondingly, which may not be the same. This added flexibility could
lead to increased expected response for 4 overall. The increase from this split is
AP(s) (2.7)
=P{X E # 1JX E #} max E[YIX E ,T = tl]t1 =1,...,K
+P{X Er#XCE I max E[YX E r,T=trtr=1,...,K
- max E[YIX E#, T= t] (2.8)t=1,...,K
The optimal split is selected to maximize this increase Ap at each step. The question
then comes down to estimations of each component in Alu(s).
The expected response after the split is a summation of the expected response from
each subspace weighted by the conditional probability of a subject falling into each
subspace given that it is in #. The weight is in consideration of imbalance between the
number of data points falling into two child-subspaces. To avoid redundant notations,
let #' stand for one of the child subspace 0i or Or. We use P(#'1#) to denote the
estimate for the conditional probability of a subject in #' given that it is in # ,and yt(4') the estimate for the conditional expected response in subspace #' under
treatment t.
The conditional probability is estimated using the sample fraction
i=1 l{x) C ' (2.9)The eI{x(e) E or}
The expected response in a subspace for each treatment is estimated using the
34
sample average
ZC> y(2)I[{x(i) E q'}li{t(0) = t} (.0yi= (Y) (2.10)ZWi II{x(i) E q'}11{t(i) = t}
However, this sample average estimation requires further investigation. With the tree
growing deeper and deeper, there are fewer and fewer samples in the feature space
corresponding to each node, which may lead to bigger variance and larger error for the
estimation. Thus we need some remedies to reduce variance and avoid the influence
from outliers. Let nt(#') be the number of samples in #' with treatment equal to t,
we instead estimate conditional expectation as follows
Qt(#) if nt(#') < min_ split,Qt (#') = (2.11)
-y(I{x)1 Y '} {t)=t}+Pt(+)nreg otherwise .X:i=ffx(')EO'}Ift0i)=t}+n reg
where minsplit and nreg are user-defined parameters.
The modification consist of two aspects. First, min_ split is a threshold such that
we inherit the estimation from the parent node if the number of samples for some
treatment is smaller than this threshold. Second, nreg is a regularity parameter
when estimating sample average to avoid outliers misleading. The larger the value
of nreg, the more importance it takes from the parent-subspace sample average
yt(#) as a shift to the child-subspace sample average. Based on our experiments,
it is usually helpful to set n_reg to a small positive integer. Note that Q(#') is
defined recursively-the effect of sample average estimation is passed down to all the
descendant nodes.
Summarizing the above, we estimate the increase in the expected response from
a candidate split s as below,
Ap(s) = P(#0 1I) x max yt(#i)t=1,..,K
+P(#,l) X max Qt(4,) - max Q(#). (2.12)t=1,...,K t=1,...,K
As to the termination condition when splits should be stopped, we need to take
35
into consideration the special structure of uplift modeling training data. In CTS, a
node is a terminal node if any of the following conditions is satisfied.
1. There does not exist a split that leads to non-negative gain in the estimated
expected response.
2. The number of samples is less than the user-defined parameter min split for
all treatments.
3. All the samples in the node have the same response value.
The first condition is similar to common decision trees, where the splits should not
be stopped until all possible splits will damage the performance of the current tree.
We allow a split with zero gain to be performed because a nonprofitable split for the
current step may lead to profitable splits in future steps. The second condition allows
us to split a node as long as there is at least one treatment containing enough samples.
When data distribution among treatments are not even, there may be cases where
there are enough data samples from one treatment, but not enough data samples
from another one. We should continue splits as the objective function could probably
increase owing to the treatment with enough data samples. This condition, together
with the formulation of conditional response expectation, allows a tree to grow to
a full extent while ensuring reliable estimates of the expected response. The third
condition is to avoid the split of pure node with the same responses. Without this
condition, we will continue splits on the pure node as long as there are enough data
samples, while it will not bring any actual gain to the objective function.
Combining the splitting criterion and termination conditions to construct each
single tree, we outline the CTS algorithm in Algorithm 1.
2.3 Experimental Evaluation
In this section, we first use a toy example to illustrate the difference between sepa-
rate model approach and our CTS algorithm, which highlights the importance of a
36
Algorithm 1 CTS - Contextual Treatment Selection
Input: training data SN, number of samples in each tree B (B < N), number of trees
ntree, number of variables to be considered for a split mtry (1 < mtry < d), the
minimum number of samples required for a split min-split, the regularity factor
n-reg
Training:
For n = 1 : ntree
1. Draw B samples from sN with replacement to create sn. Samples are drawn
proportionally from each treatment.
2. Build a tree from sn . At each node, we draw mtry coordinates at random,
then find the split with the largest increase in expected response among the
mtry coordinates as measured by the splitting criterion defined in Eq. (2.12).
3. The output of each tree is a partition of the feature space as represented
by the terminal nodes, and for each terminal node, the estimation of the
expected response under each treatment.
Prediction: Given a new point in the feature space, the predicted expected response
under a treatment is the average of the predictions from all the trees. The optimal
treatment is the one with the largest predicted expected response.
specialized algorithm for uplift modeling problem. We then compare the experimen-
tal results of CTS algorithm and other applicable uplift modeling methods on both
synthetic data and real world data. For synthetic data, we are able to analyze the
model performance from a more comprehensive perspective with knowing the true
data model. With the help of our industry collaborators, we test the model perfor-
mance on two large-scale real world datasets, and this is helpful to guide us how each
model works when we know little knowledge of the underlying data generation model.
2.3.1 Advantage of CTS Over SMA
Suppose we have a 2-D feature space X = (X1, X2 ). Both X1 and X2 follow the
uniform distribution over [0, 100]. There are two treatments. The response function
under both treatments are defined as below
37
U[0, xi] if T = 1,Y ~
U[0, x 1] + U[P, x 2 ]/10 - 2.5 if T = 2.
For this particular data model, the response under treatment 1 is independent of
X2 , while for treatment 2 there is a small uplift effect associated with X2 which is
neutralized by the overall drop (the -2.5 term). The response for each treatment
is determined to a large extent by feature X1, the individual-treatment feature. Yet,
which treatment is optimal is completely determined by X2, the cross-treatment fea-
ture. The expected response for the entire population is 25 for both treatments.
However, the expected response under treatment 1 is greater than that under treat-
ment 2 when x 2 < 50 and the other way around when x 2 > 50. Fig. 2-2 illustrates
the pattern of the optimal treatment. An oracle who knows the data generation pro-
cess can employ the optimal treatment as shown in Fig. 2-2 and obtain the optimal
expected response of 25.625.
100
75-
TreatmentC 50- 01
2
25-
0-
0 25 50 75 100x1
Figure 2-2: True decision boundary of optimal treatment on the 2-D example
The performance of Random Forest (RF) and CTS is illustrated in Fig. 2-3. The
horizontal axis shows the number of samples per treatment in the training data. The
samples are generated using the data model as described above. We test 6 different
38
data sizes per treatment, namely 2000, 4000, 8000, 10000, 20000, and 40000. For each
data size, we repeat the following procedure 10 times - generate a training data set
of prescribed size; train two models, one with RF and the other with CTS, with the
training set; compute the expected response after deploying each of the two models
using the true distribution. The expected response of RF and CTS plotted in Fig. 2-3
is the average of the 10 trials.
0N
- -n
0 A 00 200 00
X CD
nn s oracle-- CTS-y-random forest* .optimal single treatment
N
0 10000 20000 40000
number of training samples per treatment
Figure 2-3: Comparison of performance between different approaches on the 2-D example
As is shown in Fig. 2-3, while the performance of CTS and RF both improve
with the size of training data, CTS outperforms Random Forest at all the training
sizes tested. This demonstrates the benefit of modeling data from all treatments
together over separately. When building RF model separately for each treatment, X2
seems either completely irrelevant (for treatment 1) or insignificant (for treatment
2). However, by considering data from both treatment, CTS is able to capture the
weak, yet systematic uplift effect caused by X2. The difference between these two
39
modeling approaches becomes more obvious when one looks at the decision boundaries
generated by them. Fig. 2-4 illustrates the decision boundaries of the models trained
with RF and CTS when training size is 40000 per treatment.
RF decision boundary CTS decision boundary1001 011B i 101
75- 75-
Treatment Treatmntdo50- 50-
2 2
25 25
25 5 75 100 25 50 75 100X1 X1
Figure 2-4: Comparison of decision boundary between different approaches on the 2-Dexample
As we can see from Fig. 2-4, the decision boundary from CTS is much more similar
to the true decision boundary than that from RF. While RF decision boundary does
not possess a clear pattern, CTS decision boundary suggests a partition between
the upper halfspace and the lower halfspace. The numerous vertical lines present
in RF decision boundary is a sign that the model partitions the feature space more
frequently on X1 than on X2, which is precisely what is expected.
2.3.2 Synthetic Data
The second synthetic dataset we are considering is in a feature space of a fifty-
dimensional hyper-cube with length 10, i.e., Xd = [0, 10]50. The feature of each
dimension is uniformly distributed, i.e., Xd ~ U[0, 10 ], for d = 1, ... , 50. There are
four different treatments, T = 1, 2, 3, 4. The response under each treatment is defined
40
as below.
f(X) + U[0, aX1] + c if T = 1,
f(X)+U[0,aX 2]+E ifT=2,Y =<
f(X)+U[0,aX 3]+ ifT=3,
f(X)+U[0,aX 4]+( if T=4.
The response consists of three components. The first term f(X) is the systematic
dependence on the feature vector and is identical for all treatments. We choose f to
be a mixture of 50 exponential functions to reflect complex real-world scenarios.
f(Xi,.., Xo) (2.13)
50
=Z a' exp{-bjxii - - - b'11X 5o - Ciol},
where a', bj and c) are chosen randomly.
The second term U[0, aXt] represents treatment effect and is different across treat-
ments. We set a to be 0.4 which is roughly 5% of E[lf(X)I], where the treatment
effect is of a lower order of the magnitude of the main effect. The third term ( is zero-
mean Gaussian noise, i.e. c - N(0, o.2 ). . is set to 0.8 which is twice the amplitude
of the treatment effect.
Under this particular data model, the expected response is the same for all treat-
ments, i.e., E[YIT = t] = 5.18 for t = 1, 2, 3, 4. The expected response under the
optimal treatment rule E[YIT = h*(X)] is 5.79. 2
It is a multi-treatment uplift problem with continuous response, thus there are
5 methods applicable: CTS, Separate Model Approach with Random Forest (SMA-
RF), K-Nearest Neighbor (SMA-KNN), Support Vector Regressor with Radial Basis
Kernel (SMA-SVR), and AdaBoost (SMA-Ada). We develope a R package for CTS,
and use the implementation in scikit-learn for other algorithms. The performance of
2Exact values of data model parameters and datasets can be found at this Dropbox link https://www.dropbox. com/sh/sf7nu2uw8tcwreu/AAAhqQnaUpR5vCfxSsYsM4Tda?d1=0
41
- - Optimal - -- Single Trt - - SMA-ADA -(-SMA-RF
-O-SMA-KNN <>SMA-SVR -CTS5.8
5.70
5.6
5.5
5.4
5.3
--- . . . . . . . . --- -- -- --------- ------- --- --------
5.10 5000 10000 15000 20000 25000 30000
Training Data Size (samples per treatment)
Figure 2-5: Averaged expected response of different algorithms on the synthetic data. Theerror bars are 95% margin of error computed with results from 10 training datasets. Foreach data size, all algorithms are trained with the same 10 datasets.
these 5 methods is compared on different sizes of training data. We train the models
on training data with 500, 2000, 4000, 8000, 16000, and 32000 samples per treatment,
and evaluate each model using Monte Carlo simulation on the true data model. This
is done 10 times independently to compute the average result and the error margin
as well. To ensure consistency in comparison, all the methods are trained with the
same 10 datasets for each data size. All models are tuned carefully with validation
or cross-validation, with detail of the parameter selection procedure specific to each
algorithm found in the Appendix A.
We plot the averaged expected response of the 5 methods in Fig. 2-5, with the
vertical bars showing the 95% margin of error computed with results from 10 indepen-
dent trials. The single treatment expected response (short dash line without markers)
and the optimal expected response (long dash line without markers) are included as
42
a reference.
Fig. 2-5 shows that with 2000 samples per treatment, the performance of CTS
exceeds that of the separate model approach, and the advantage continues to grow as
the training size increases. CTS can achieve almost the oracle performance with 32000
samples per treatment. Note the significant difference between CTS and Random
Forest. They are both tree-based algorithms, with the only essential difference in
the splitting criterion and termination conditions, which makes their results far from
similar. Among the algorithms for the separate model approach, support vector
regressor performs the best. This is probably because the underlying data model
is a mixture exponentials, which is consistent with the assumption of SVR using
radial basis kernels. In contrast, other separate model approach algorithms work
only slightly better than assigning a fixed treatment even with the largest training
size. Thus the separate model approach can only achieve uplift if the model trained
for each treatment is accurate enough, while model mis-specification will deteriorate
the performance seriously.
2.3.3 Priority Boarding Data
We present our work with an European airline as an example of application for ran-
domized experiment and uplift modeling in the online pricing problem. Airlines led
the business world in their development of variable pricing of airfares, embracing the
truism that all customers are different and have different needs. Our collaborator
airline changed the price of priority boarding from E 5 and E 7 at some point and
provided us a dataset containing passengers' purchase records under the two price
levels. We made some initial analysis and found that there indeed exists customer
heterogeneity in price sensitivity. Fig 2-6 shows the change in revenue per passenger
with respect to passengers' feature. For illustration purpose, we only show two fea-
tures: days between booking and departure and flight fare. The red areas represent
customers preferring high price, green areas represent customers preferring low price,
and yellow areas represent customers neural to these two prices. This confirms that
the purchasing behavior of passengers varies significantly and it can be beneficial to
43
customize price offering based on certain characteristics.
a) 0.4
- 0.2
- m~E0.0_A 0.4
M -0.8
Days between Booking and Departure
# of passengers who buyColor = low pnce * # of passengers in total low price
. # of passengers who buy# of passengers in totall high price
Figure 2-6: Change in revenue per passenger with respect to passenger features
We derive a total of 9 features from the raw transaction records: the origin station,
the origin-destination pair, the departure weekday, the arrival weekday, the number
of days between flight booking and departure, flight fare, flight fare per passenger,
flight fare per passenger per mile, and the group size. The data is randomly split into
the training set (225,000 samples per treatment) and the test set (75,000 samples per
treatment). We test six methods: the separate model approach with Random Forest
(SMA-RF), Support Vector Machine (SMA-SVM), Adaboost (SMA-Ada), K-Nearest
Neighbors (SMA-KNN), as well as the uplift Random Forest method implemented in
[Guelman 2014], and CTS. In this dataset, the response is actually discrete: E 5/E 7
if there is a purchase and 0 if not. Thus we model it as binary classification problem
for the first 5 methods: 1 for purchase and 0 for non-purchase. The expected revenue
is the product of purchase probability and the corresponding price. As CTS can
handle discrete response, we use the revenue as response directly. Cross-validation
44
j 0.60
0.52W.5 0.490049
05 0.44 0.450.42 0.42 0.42 0.42
0.40 - -
0.30 -
0.20 - - -- - - -
0.10 -
0.00
Figure 2-7: Expected revenue per passenger from priority boarding based on differentmodels.
is conducted to select the optimal parameter. See Appendix for detail on parameter
tuning.
The expected revenue per passenger on the test set is plotted in Fig. 2-7. If only
one price is offered to the whole population, revenue per passenger overall is the same
- 4 0.42. The figure shows that customized pricing models can significantly increase
the revenue from 60.42 to C0.52. This 20% uplift could lead to a remarkable gain in
profit for an airline with tens of millions of scheduled passengers per year. We again
see the importance of specialized algorithms for uplift modeling from this test case.
We compare the modified uplift curves for the 6 methods in Fig. 2-8. CTS performs
the best at all operating points of population percentage. The upliftRF algorithm
ranks next and outperforms the separate model approach. It is worth to note the
performance of SMA-RF: the sharp rise at the beginning of the curve and the sharp
45
-o-CTS -- UpliftRF -SMA-RFSMA-Ada -- SMA-SVM .... SMA-KNN
o 0.54
(U
cO 0.52
0.5CU
0.48
0.46
00.0.44
CU
0.4 -- ---
. 0 0.2 0.4 0.6 0.8 1
Population Percentage Subject to Treatment
Figure 2-8: Modified uplift curves of different algorithms for the priority boarding data.
decline at the end show that it is very accurate when identifying subpopulation for
which the treatment is highly beneficial or extremely harmful; while the (almost)
straight line for the middle part reflects its failure to predict the treatment effect for
the vast majority. Similar with the poor performance in expected revenue, SMA-SVM
and SMA-KNN rank lowest. This is probably because the distance calculation cannot
handle handling categorical variables well and the mis-specification of SVM.
Though revenue maximization is the most focused topic in dynamic pricing, we
also care about how the pricing policy will affect users. A user has an objective
measure of her well-being when she takes an action, and this measure is called utility
level in economics. The standard way to view utility is as representing an individual's
choice. The user's action can reveal her preference, and utility is a mathematical
construction for modeling choice and preference [Wong 1978]. In the dynamic pricing
problem, when a user makes a purchase with an offered price, she is willing to pay
46
for that price to get priority boarding service and thus she receives positive utility;
on the contrary, she receives zero utility if she does not make a purchase. To consider
the benefit of both seller (airline company) and buyer (passenger), we will take the
summation of revenue and utility as the response.
It usually requires empirical behavioral studies to learn the utility function which
is not the topic of this thesis, therefore we assume that a user gets utility 5 and 2.5
if she makes a purchase with an offered price of 5 and 7 respectively, and gets utility
0 if she does not make a purchase. Note that the airline company is the decision-
maker and decides which price will be offered to which user, and the user only has the
right to decide whether she will accept the offered price but not to choose between
two price levels. Under this formulation, we follow the previous approach to train
different models: SMA-RF, upliftRF, and CTS. The expected summed response per
passenger on the test set are shown in Fig. 2-9. Not surprisingly, CTS outperforms
other customized pricing policy and single pricing policy. We take a further look
at what revenue and user utility look like respectively under these pricing policies
in Fig. 2-10. We can see that CTS achieve best results in both revenue and user
utility though we have not optimized them separately during the model training.
This example shows that customized pricing is a win-win choice for both seller and
buyer as it is to capture the market's consumer surplus and transfer this surplus to
sellers for their benefit, while it does not hurt users' interest with delicate modeling
approach.
2.3.4 Seat Reservation Data
Following the success of customized pricing in priority boarding, we started our col-
laboration with another airline company and applied our model in seat reservation
pricing.
We conducted a well-designed randomized experiment with four price levels -
low (L), medium low (ML), medium high (MH) and high (H). The response is the
revenue from each transaction. We derive features from the raw transaction records,
including the booking hour, the booking weekday, the travel hour, the travel weekday,
47
1.20
1.00
0.80
0.60
0.890.84
--0.96-1.01
0.67
0.40 +
0.20
0.00 --5C 7 f SMA-RF upliftRF CTS
Figure 2-9: Expected summed response per passenger from priority boarding with theconsideration of user utility.
48
U revenue U utility
0.420.42 0.42
-----
0.450.44
).15
7 f SMA-RF upliftRF
Figure 2-10: Expected response per passenger from priority boarding with the considera-tion of user utility.
49
0
W
nAQ0.52
0.60
0.50 -
0.40
0.30
0.20
0.10
5 E CTS
).47 ---
the number of days between the last login, the next flight, the fare class, the zone
code (all flight routes are divided into 3 zones, and prices are set for different zones),
whether the passenger returns to the website after flight ticket purchase, the journey
travel time, the segment travel time, the number of passengers, the quantity available,
etc.
The experiment collects a dataset with 213,488, 176,637, 160,576 and 214,515
samples for each treatment. We use 50% for training, 30% for validation, and 20%
for test. During the experiment, prices on different routes are different under the same
level for practical needs, and one transaction may include the purchase of multiple
seats, so we need to model the response as a continuous variable. Thus Uplift random
forest is not applicable as it can not handle continuous responses. Among the separate
model approach, we only test SMA-RF because it far surpasses other models in the
previous example of priority boarding data.
We compare the average revenue per customer with different pricing models in
Fig. 2-11. By offering a single price in the entire market, the maximum expected
revenue is $1.87 per passenger with price level H. With personalizing treatment as-
signment, the expected revenue can increase to $2.37 with SMA-RF and $3.35 with
CTS. Our specialized algorithm for uplift modeling improves the airline's revenue by
as much as 80The modified uplift curves of SMA-RF and CTS is shown in Fig. 2-12.
We can see that CTS outperforms SMA-RF at every operating point of population
percentage.
50
4.00
3.50 - -
1
3.00------- 2.50 ------------------------ ------- - -
2.00 -------------------------- 871.64 1.70
1.50 1A2
1.00 -- --
04 0.50 - -- - ------
0.00L ML MH H SMA-RF CTS
Figure 2-11: Expected revenue per passenger from seat reservation when applying differentpricing models.
51
o CTS -SMA-RFO 2.3
2.2
2.1
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3 --0 0.2 0.4 0.6 0.8 1
Population Percentage Subject to Treatment
Figure 2-12: Modified uplift curves of SMA-RF and CTS on the seat reservation data.
2.4 Summary
In this chapter, we present a complete framework for uplift modeling problem. In
Section 2.1.1, we first describe the mathematical formulation of uplift modeling prob-
lems. The definition of an objective function points out how to design uplift modeling
algorithms aligning with the problem objective. We then put forward an evaluation
metric that can measure uplift algorithms with randomized experiment data offline
in Section 2.1.2. With this evaluation metric, in Section 2.2, we design an ensemble
tree-based algorithm with specialized splitting criterion to maximize the quantity of
interest directly and termination conditions for the special structure of uplift model-
ing problem and training data. Experimental results on synthetic data and real-world
data in Section 2.3 showed that our algorithm outperform separate model approach
and existing uplift modeling algorithms.
In Section 2.3.3, we applied our algorithm to dynamic pricing problem and it
achieved superior performance, not only increasing airline company's revenue but also
52
not hurting user's utility. This is particularly instructive to the airline industry. Since
the deregulation in 1978, airlines have the freedom to set the prices by themselves
and this industry became more and more competitive. High fixed cost because of
expensive jets, large labor forces, and skyrocketing fuel costs, have caused many
legacy airlines to struggle, even at one point filed for bankruptcy. However, closing
down a large unprofitable airline would be a politically unpalatable decision as various
stakeholders cannot afford it. It involves the loss of thousands of jobs, inconvenience
to hundreds of thousands of travelers, and millions in losses for the airline's creditors
[Cornia 20111. Instead of cut-throat pricing in the entire market, customized pricing
is a more sustainable strategy. Charging higher prices to customers with a lower
price elasticity, or equivalently with a higher willingness-to-pay, airlines are able to
increase the average markup of prices, and thus increase their profits. Our algorithm
provides a systematic way to identify and separate customers with different price
elasticities, though it requires further work to accurately evaluate user utility and
how to incorporate user utility with revenue is worth some more discussions.
53
54
Chapter 3
Theoretical Analysis
3.1 Introduction
In addition to the performance comparison on different datasets, we would like to
further analyze the algorithm on the theoretical side. First of all, the consistency
analysis will give a basic theoretical guarantee of the algorithm. Moreover, in the
general context of treatment selection problem, we are interested in seeing how the
algorithm works compared to an oracle who knows the true optimal treatment for
every subject. The gap between the expected response achieved by the oracle and
the one achieved by a treatment assignment model without knowing response func-
tion, denoted as regret, has been studied extensively in multi-armed bandit (MAB)
problem.
3.1.1 Multi-Armed Bandit Problem
A multi-armed bandit problem can be considered as an online experiment of treatment
selection problem. With subjects coming sequentially, the decision maker chooses one
of the possible actions (treatments) and observe its response at each time step. The
goal is to maximize the sum of responses over all time steps. With unknown response
function for different treatments, the decision maker faces a exploration-exploitation
tradeoff. Subjects are split into different treatment according to the designed policy
55
at the beginning (exploration phase), and a currently best performing treatment is
selected until you are confident of its performance for the rest of time (exploitation
phase).
In context-free multi-armed bandit problems, first bounds on the regret for the
model with finite action space were obtained in the classic paper by [Lai 19851, which
introduced the technique of upper confidence bounds for the asymptotic analysis of
regret. Regret optimal algorithms in the non-stochastic bandit problems is provided in
[Auer 20031. Continuous treatment spaces and response functions satisfying Lipschitz
condition were studied in [Kleinberg 2005a] [Kleinberg 2005b].
In recent years, multi-armed bandit problem with side information attracts more
interest. [Wang 2005a] introduced contextual stochastic bandit problem. It is gener-
alized to the multivariate linear case by [Rusmevichientong 2010], and to the multi-
variate and nonparametric case by [Perchet 2013]. [Alina 2011] present an algorithm
with comparable regret guarantee to those in standard supervise learning setting.
Though multi-armed bandit approach can achieve near-optimal performance in
theory, it suffers from the uncertainty of treatment allocation process and it cannot
be evaluated offline because of the interactive nature of the problem. Meanwhile, the
advances in operations management enables companies to quickly collect and process
large amounts of data, which provides abundant resources for offline algorithms like
CTS.
3.1.2 Tree-Based Algorithm
We have seen superior performance of tree-based algorithms in Chapter 2 over other
parametric/nonparametric methods, and tree-based models have also proven to be
reliable predictive algorithms in many application areas. However, the statistical
properties of tree-based models are not yet fully understood. [Wager 2017] develops
a causal forest for estimating heterogeneous treatment effects, which is pointwise con-
sistent for the true treatment effect and has an asymptotically Gaussian and centered
sampling distribution. This is the first set of results of random forest in the context
of personalized treatment selection. Results in [Wager 20171, though limited to sin-
56
gle treatment (treatment vs control), show that tree-based models are substantially
more powerful than classical methods based on nearest-neighbor matching not only
in experimental performance but also in theoretical properties.
In contrast to the most popular classification and regression tree (CART), which
grows a full tree by orthogonally splitting the axes at locally optimal splitting points,
there is a new family of decision trees, dyadic decision trees (DDT), which attain
nearly optimal (in a minimax sense) rates of convergence for a broad range of classi-
fication problems [Gey 2015]. DDT are restricted only axis-orthogonal dyadic splits,
i.e. each dimension can only be split at its midpoint. DDTs can approximate complex
decision boundaries, and the restriction to dyadic splits makes it possible to globally
optimize a complexity penalized empirical risk criterion [Scott 20051. We apply dyadic
split to our CTS algorithm and prove that this modified algorithm, able to handle
multiple treatments, is asymptotically optimal under mild regularity conditions.
3.2 Consistency Analysis
An unbiased contextual treatment selection (UCTS) algorithm is put forward in
[Zhao 2017] and proven to be consistent under some mild assumptions. In CTS
algorithm, an exhaustive search is conducted to find the best splitting point at each
step. This approach is locally optimal and easily prone to bias from outliers. This
issue is exaggerated in our CTS algorithm, as the increase in expected response is
calculated using data samples from all treatment and the selection of the splitting
point will be affected by outliers from any treatment. The effects of extreme values
will be accumulated with successive splits grouping similar extreme values and thus
more and more bias are introduced.
In order to reduce the bias, UCTS uses different data sets for model training and
response prediction. It randomly splits the training set SN into the approximation
set SA and the estimation set SE. The size of approximation set and estimation set
is determined by a user-defined parameter rhoE (0, 1): for each treatment, a fraction
rho of the examples in SN is sampled to build SA; and the rest belongs to SE
57
Algorithm 2 Unbiased Contextual Treatment Selection (UCTS)
Input: training data SN, fraction of data used for partition generation rho, number of
trees ntree, number of variables to be considered for a split mtry (1 < mtry d),
the minimum number of samples required for a split min-split, the regularity
factor n.reg, the tree balance factor alpha
T'raining:
For n = 1 : ntree
1. Draw round(rho x N) samples from SN to create the approximation set SA.
Samples are drawn proportionally from each treatment. The estimation set
sE = N _ SA.
2. Build a tree from SA. At each step of the growing process, one coordinate
is drawn at random with probability pi, or mtry coordinates are drawn at
random with probability 1- pi. We perform the split that has the largest 6P
among all the alpha-regular splits on the selected coordinate or coordinates.
The output of this step is the set of nodes 4D of the tree.
3. With SE we estimate the conditional expectation under each treatment for
all the nodes in <b as described in Section 3.1.3.
Prediction: Given a test point, the predicted expected response under a treatment
is the average of the predictions from all the trees. The optimal treatment is the
one with the largest predicted expected response.
Each tree is built using the splitting criteria and terminations conditions described
in Section 2.2 with data in SA. Let 4 be a node in the tree, we denote the data
samples in SE that fall into # with treatment t as SE(q, t) . If SE(0, t) is not empty,
then the estimation of conditional expected response in 4 under treatment t is the
sample average; otherwise, the estimation is inherited from its parent node. With
estimations for the root node calculated initially, we can obtain the estimations for
all nodes either using sample average or traversing back level by level. The UCTS
algorithm is outlined in Algorithm 2.
To recall the problem formulation in Section 2.1.1, in uplift modeling, the quantity
of interest is the conditional expectation for an observed subject x and an assigned
58
treatment t
p(x, t) E[YIX = x, T = t] (3.1)
Suppose we have a dataset SN = { (cxw, ti, y( ), i = 1,..., N } containing N joint
realizations of (X, T, Y) from a randomized experiment. The objective of uplift
modeling is to learn a mapping hN from the feature space to the treatment space:
Xd _ {0, 1, . . . , K} such that the expected response conditional on this mapping
E[ YIT = h(X)] is maximized. The optimal treatment assignment rule h* satisfies
the point-wise optimal condition, i.e., Vx E Xd,
h*(x) E arg max E[YX = x, T = t]. (3.2)t=l,...,K
Now we define the consistency of the uplift algorithm as the following.
Definition 3.2.1. An uplift algorithm is L2 Consistent if
E{p[X, h*((X))] - 1 [X, hN(X)]} 2 -+ 0 as N -+ oc (3.3)
where the expectation is taken over both the test example X and the training set SN-
The consistency result is derived based on the following assumptions.
1. Features are uniformly distributed in the d-dimensional unit hypercube, i.e.,
X ~ U[0, 1]d.
2. The response is bounded, i.e. there exists Cy > 0 such that IY < Cy.
3. The conditional expectation function p(x t) satisfies Lipschitz condition, i.e.,
there exists a constant CL > 0 such that Vx1 , x 2 , Vt E {1, ... , K}
I P(X1, t) - P (X2, t)| I CL IX1 - X21 (3.4)
4. The parameters alpha and min-split are chosen properly with n such that
limN o - = 0 and limNoo 1nN = 0, because a UCTS tree is a (a, k, )-regulart ,k
treelwith ce=alpha, l=min...split and k=alpha. min-.split.
59
If the above assumptions are satisfied, then a treatment selection rule hN con-
structed by UCTS algorithm is L2 consistent. Detailed proof can be found in [Zhao 2017].
The intuition is quite straightforward. We can properly tune parameter min-split
such that limNo = 0, thus the dimension of a leaf node vanishes as well as the
within-node variance in response. And the leaf node response can be estimated to an
arbitrary accuracy if k -+ 00 as N -- oc.
We take a further look at the difference between CTS and UCTS on a simple
2-dimensional example. The first feature X1 is continuous and uniformly distributed
between 0 and 100, i.e., X1 - U[0, 100]. The second feature X2 is discrete with values
A, B, C and each value occurs with probability 1/3. There are two treatments and
the response under each treatment is defined as below.
If T = 1, Y ~ U[0, X1 ]
0.8 x U[0, X1 ] + 5 ifX 2 = B,If T = 2, Y ~
1.2 x U[0, X 1] - 5 ifX 2 = A or C
The optimal treatment rule for this data model is plotted in Fig. 3-1. The
vertical boundary in the middle is located at X1 = 50. We use different ranges along
X2 axis to represent different values for illustration purpose. In Fig. 3-1, we can
see a sharp change around X1 = 50, while it is worth to note that this is because
discrete values of X2 are plotted like continuous and the actual different between
treatments changes smoothly with X1 for each value of X2 . Therefore it would be
more challenging to identify the correct treatment around X1 = 50. In addition, the
variance of response grows quadratically with X1, therefore the abilities to identify
optimal treatment should be more different when X1 is large because CTS is more
susceptible to extreme values than UCTS.
We compare the decision boundary reconstructed by the two algorithms on three
'An uplift tree is (a, k, 1)-regular for some 0 < a < 0.5 if each split leaves at least a fraction aof the available training examples on each side of the split and each leaf node contains at least ktraining examples for some k E N. In each leaf node, there are at most 1 training examples for eachtreatment, with 1 E N and 1 > 2k.
60
CBestTreatment
X 2 B
A
0 25 50 75 100X1
Figure 3-1: The optimal treatment rule for the 2D example. The vertical boundary alongX 1 axis is located at X1 = 50. The vertical axis X2 is a discrete variable.
different training sets in Fig. 3-2. It shows that the decision boundary generated by
UCTS is much smoother than that by CTS for all training sets. This phenomenon is
more significant on the right side of each plot when the variance of response is higher
and there are more extreme values, which is in line with our expectation.
Seeing better performance in consistency, we would like to know how UCTS works
in prediction performance. We train UCTS models and CTS models using 50 training
sets and calculate the expected response under the true data model. The average ex-
pected response and the 95% confidence interval are plotted in Fig. 3-3. It shows that
UCTS is fully competitive with CTS, without sacrificing performance for smoothness.
3.3 Asymptotic Properties
Here we prove the convergence rate of a simplified version of algorithm CTS. Al-
gorithm CTS.0, as outlined below, partitions the feature space in a deterministic
and mostly uniform manner, which eases the theoretical analysis of its convergence
61
UCTS CTS
Training Set 1
II IHI
Training Set 2
Training Set 3
Figure 3-2: The treatment rule reconstructed by UCTS and CTS for the 2D example.Plots on each are generated from the same training set. The horizontal axis is feature X1and the vertical axis feature X2. The labels and ticks of the axes are the same as those inFig. 3-1 and omitted here for simplicity.
62
26.50 -r-
26.00 - -----
25.50
25.00
24.50 -UCTS CTS
Figure 3-3: Average expected response under UCTS and CTS models for the 2D examplecomputed from 50 training sets with 95% confidence interval shown as the error bar.
property, while also provides some guidance on CTS.
If we know the conditional expectation E[YIX = x, T = k] = P(x, k) for any point
x in the feature space and for any treatment, we can partition the feature space Xd
into K subspaces ) 1, ., K such that
VX E )k, k E arg max p(x, k).k'=1,...,K
It is easy to see that we can achieve the Oracle performance by simply assigning 4k
to k for k = 1, ... , K. The challenge is that the true boundaries of (1,..., 4 K can be
arbitrarily complicated for practical problems. What is needed is a tractable way
to approximate these partitions. One option is discretization. We can continuously
split the feature space using coordinate-perpendicular cuts until the entire feature
space is divided into a set of small "boxes" and assign each box to the empirically
best treatment. Intuitively, the smaller the boxes are, the better they are able to
approximate the true decision boundary. However, since we need to infer the optimal
treatment from samples, and smaller boxes means fewer samples in each box, we
63
want to restrict the size of the boxes so that there are sufficient samples in most
boxes to make correct assignments. Fig. 3-4 shows how discretization error comes
from coordinate-perpendicular cuts. Understanding the counteraction between the
error from discretization and the error from misassignment will give us insights on
how to design an algorithm to obtain the small overall error. This is precisely the
goal of this section.
discretization error dependson the diameter of the bcoc
discretization error is 0
Figure 3-4: Illustration of dicretization error in 2-D space. The diagonal curve representsthe true decision boundary, with left bottom part assigned to treatment yellow and rightupper part assigned to treatment blue.
Section 3.3.1 describes a simple algorithm based on discretization named CTS.O.
In Section 3.3.2 we derive upper bounds for both the discretization error and the
misassignment error of CTS.O. We prove that, under mild regularity conditions, the
expected difference in performance between CTS.0 and an optimal treatment rule h*
approaches 0 as the number of samples goes to infinity, and the rate of convergence
is bounded above by O(N- (+ ), where N is the number of samples and d the
64
X2
Xd
F7I F7 s 0607 48 09 4010
X1
Figure 3-5: An example of feature space partition and treatment assignment for Algorithm
CTS.0 when Xd = [0, 112 and M = 10. Subspace #1 is assigned to treatment yellow and
subspace 02 to an arbitrary treatment.
dimension of the feature space.
3.3.1 Algorithm CTS.0
We first define some notations to facilitate the description of the algorithm. Given a
data set SN and a subspace # c Xd where Xd is a bounded subspace of Rd., define
N
fi(#, k) = I{x(' E ), t0) = k}, (3.5)
N
9(#, k) = y(' e x(E), t(0 = k}. (3.6)
If Tz(q, k) > 0, define
9(0, k) k) (3.7)
In other words, ft(#, k) is the number of samples in subspace # with treatment equal
to k, and 9(0, k) is the average response of these samples.
CTS.0 is outlined in Algorithm 3. It contains two phases. Phase 1 divides the
feature space into M subspaces through repeated partitioning at the midpoint. Phase
2 assigns each subspace to an empirically best treatment-based on the training data.
Fig. 3-5 shows an example of the two phases for M = 10 in 2D feature space.
65
009. S
0*** S
SS*0 *
.0,
4~i 0 0
Algorithm 3 CTS.0
Input: feature space Xd, number of partitions M, data SN
Phase 1 - Partitioning feature space
1: <D = {Xd}
2: dim = 1
3: for m = 1: [log2 MJ do
4: for every element of 4D do
5: replace it with two subspaces created by splitting it equally at dim-th
dimension
6: end for
7: dim = (dim mod d) + 1
8: end for
9: dim = (dim mod d) + 1
10: for j = 1: (M - 2[log2 MJ) do
11: replace the jth element in D with two subspaces created by splitting it equally
at dim-th dimension
12: end for
Phase 2 - Assigning partitions to treatments
for m=1:M do
if ii(m, k) > 0 for k = 1,...,Kthen
k# = arg maxk=1,...,K{ (Om, k)}, breaking ties randomly if optimal solution
is not unique
else
17: k# = 0
18: end if
19: end for
Output: The treatment rule
k# if k # 0,
1 ifk#=O
66
13:
14:
15:
16:
h#. Vx E Om, m = 1, ... , M,
h# (x) =
3.3.2 Upper Bound on Expected Regret
In this section we aim to understand how fast the performance of a treatment rule
trained with CTS.0 approaches the Oracle performance as the training size increases.
Let h# denote a treatment rule generated with CTS.0. Define the Expected Regret as
E max [pi(X, k)] - p(X, h#(X))} (3.8)Sk=1,...,KI
which is the difference between the expected response under the optimal treatment
rule and that under a treatment rule generated by CTS.0 with training set SN. Note
that the expectation is taken over both the training set SN and the test point X.
In order to bound the expected regret we need to make a few assumptions.
1. We assume the feature space Xd is a subspace of Rd and there exist Ru > RL > 0
such that
RL < max dzi - X'1 < Ru, i = 1, ... d.
2. For a subspace # C Xd, denote as fx (#) the probability that X E #. We assume
there exists a p > 0 such that f(#) > p -Volume(#) for all # C RE.
3. The response Y is bounded, i.e. there exists Ymax > 0 such that IYI Ymax.
4. We assume that the conditional expectation p(x, k) = E[YIX = x, T = k]
satisfies the Lipschitz condition, i.e., there exists a CL > 0 such that Vx, x' E
Xd, Vk E {1, ... , K},
Ip(x, k) - i(x', k)I CL Ix-x'.
5. We assume the N observations in SN are independent from each other. We also
assume the treatment values follow the discrete uniform distribution on 1,...,K
and are independent from the features.
Theorem 3. If the above assumptions are satisfied, then for any given number of
67
partitions M, the expected regret of CTS.0 is bounded above as following,
E max [it(X, k)] -p (X, h#(X))}
< 4CLRd-M--'Discretization Error
1 1 pRdKd )N+ 8K2Ymaxe MN-2+ 2YmaxK 1- L/2dK(3.9)
M ,1Misassignment Error
It is clear from Eq. (3.9) that the bound on discretization error decreases as the
number of partitions increases. The bound on misassignment error, as expected,
increases with the number of partitions. The following corollary explains how we can
achieve the asymptotically tightest bound by setting a proper value for M.
d
Corollary 1. Under the same conditions as Theorem 3, by setting M = round (N 2(d+1),
we have
E max [yp(X, k)] - p(X, h#(X)) }= o(N - U(d ).
The significance of Corollary 1 is twofold. First, it proves that the performance
difference between an optimal treatment rule and a treatment rule constructed from
data can approach zero as the number of samples increases to infinity. Second, the
rate of convergence can not be slower than O(N- 2(d 1)). The dependence of d in the
upper bound emphasizes the necessity of feature selection. While the performance
of the optimal treatment rule improves with the number of features in the model, it
also requires exponentially more data to approximate the optimal treatment rule to a
certain degree. Choosing the appropriate complexity of the model and selecting the
most powerful features are critical to achieving desirable performance.
The rest of this section is devoted to the proof of Theorem 3 which starts with
three facilitating lemmas.
Lemma 4 describes how the size of the partitions of the feature space decreases
with the number of partitions.
Lemma 4. Let Xd be a d-dimensional feature space with all dimensions bounded
above by Ru and below by RL. Let <D = {#1,... , be the partition of Xd pro-
68
duced by phase 1 of Algorithm CTS.0. Then, Vem E 4, 0m is bounded above in all
dimensions by 2RuM-d and below by 1RLM
Proof. Proof is straightforward and therefore omitted.
Lemma 5 proves that the performance difference between a point-wise optimal
treatment rule and a subspace-wise optimal treatment rule is proportional to the size
of the subspace.
Lemma 5. Let # be a subspace of Xd with all dimensions bounded above by R.
Denote the conditional expectation of Y as It(#, k) = E[Y IX C 0, T = k]. Let
k* E arg maxk=l,...,K[I(#, k)]. If p(.) satisfies the Lipschitz condition on #, i.e., there
exists some CL > 0 such that Vx, x' E q, Vk E {1, ... , K}
I|p(x, k) - p(x', k) CLIx - x' ,
then we have
E max [[(X, k)] - [t(X, k*) I X 2CLd R.
Proof. Vx E 4, let k+ E arg maxk=1,...,K[I(x, k)]. By the Lipschitz condition, we have
, k*) - CLd"'R < p(x, k*) < p(#, k*) + CLd'R,
p(#, k+) - CLd"IR < p(x, k+) < I(q, k+) + CLdR,
and therefore
p(x, k+) - tt(x, k*)
< p(#, k+) + CLdAR - [t(#, k*) + CLdAR
2CLdAR.
69
Because the bound in the inequality above dose not depend on x, we have
E max [p(X, k)] - p(X, k*) I X C }
< E{2CdIR | X E #} = 2CLd R.
Lemma 6 states that the probability of selecting a sub-optimal treatment for a
subspace decreases exponentially with the number of samples.
Lemma 6. Let # be a subspace of Xd with all dimensions bounded above by R. Let
k* E arg maxk=1,...,K [(, k). Given dataset SN, let k# be the assigned treatment of #
as determined by phase 2 of Algorithm CTS.0. Then Vk' E {1,. . . , K}\ arg maxk=1,...,K[J, k)],
we have
Pr{k# = k'} < 2exp{ 32 f 2 ()N
where A = p(#, k*) - p(, k').
Proof. Let a = 1 [p(, k') + t(#, k*)].
Pr{ k# = k' }
< Pr{ (# k') -Ti(O, k*) ! 9(0, k*) -Ti(#, k')}
= 1 - Pr{ 9(0, k') -T(#, k*) < ;(#, k*) -i(#, k') }
< 1 - Pr{ 9(#, k') - an(#, k') < 0,
9(#, k*) - ail(o, k*) > 0 }
= Pr{(#, k') - ah(#, k') > 0
or 9(0, k*) - aii(#, k*) 0}
< Pr{9(0, k') - ac(#, k') 0 }
+ Pr{(#, k*) - aii(, k*) < 0}
70
For i = 1, ... , N, define
Zi = YI - 1{X, E 0, T = k'} - a{Xi G $,X T = k'}.
It is simple to show that E[Zi] = - K f () and Zi is bounded between [2Ymax, 2Ymax].
Because Z1, ... , ZN are independent, we can apply Hoeffding's Inequality and
Pr{(#, k') - ah(#, k') > 0}N ,
=Pr{ Z' > 0 }i=1
N
=Pr{ Z,A
- [-- f(O)N]2K
A> -f(O5)N}
_2K
2( L)2f 2(O)N 2< exp{ - (Y 21N -(4Ymar)2
Ayf2 (q)N= exp 32K2yax
In similar ways we can prove
Pr{ (0, k*) - ah(#, k*) O} 0 exp - 32K2yO
Combine these results we have
Pr{k# = k'} < 2exp { - 2f2(O)N32K2 V a}
El
Proof of Theorem 3
Proof. Assuming M and Xd are given, the partition 4 generated by Phase 1 of Algo-
rithm CTS.0 is deterministic. For m = 1,..., M, let k* E arg maxk=l,...,K Y(0n, k).
71
We can decompose the expected regret into two parts.
E max [p(X, k)] -p (X, h#(X))k=1,...,K
M
= E{ max [p(X, k)]M=1 k=,.K
M
= E{ max [p(X, k)]m=1 k=,.K
M
+ E{ip(X, k*) -
- part 1 + part 2
- p(X, k#)IX E OmIf(0m)
- p(X, k*) IX E Om If(0m)
p(X, k#) IX E OmIf(0m)
Part 1 is the expected difference in response from using point-wise optimal treatment
and using subspace-wise optimal treatment. Combining Lemma 4 and Lemma 5 we
can bound part 1 as below
M
part 1 < E 2CLd- * 2RUM- 'f(Om)m= 1
=4CLRUdIM-d.
Part 2 is the expected regret from assigning subspaces to sub-optimal treatments. We
need Lemma 6 to bound Part 2. For m = 1, ... , M, let Bm = arg maXk=1,...,K pm, k)
72
and Am(k) = p(m, k* ) - p(m, k).
E{p(X, k*) - p(X, k#) |X C #m}
- [(#m,k)-(Om, k')] Pr{k# = k'}k'gBm
<2YmaxPr{ k# = 0} + Ek'gBmU{O
<2YmaxK[1 - f (OM)INK
+ S Am(k')2 exp{k'$BmU{O}
Am(k')Pr{k# = k'}
}
A2(k')f 2(m)N32K2y2a
Because the volume of Om is not smaller than R'Ml/2d (Lemma 4), we have
2YmaxK[1 - f(M)]N < 2YmaxK (1-
The real-valued function f(x) = xe-x2 < a-I if a
A2 (k')Am(k')2 exp{ - 32
k'$BmU{O}
k'VBmU{O}
< 8K2 Ymax
e 2 f(0m) VN
pRL2dK) N
> 0, therefore
f2(0m)N
2y ax
f2(m) -32K2ya
73
Part 2 can be bounded as below.
part 2
M
= E{p(X, k*) - p(X, k*)IX E OmIf(m)m=1
M dKd N5 f( Om)2YmaxK 1 MpRL/2dKm=1
M 8K2Ymax+ f(#(m) 1
_=1 e f(# m) vN
2YmaxK I - pR/2dK )N + 8K2 YmaxMM e21 /N
Combining part 1 and part 2 completes the proof.
3.4 Summary
In this chapter, we present the theoretical analysis of tree-based algorithms in uplift
modeling. Separating training data into one set for model training and the other one
for response prediction, we prove that the algorithm is consistent. Thus our algorithm
is asymptotically optimal with infinite training data. Moreover, experimental results
show that this approach will not sacrifice prediction performance. Applying dyadic
splits, we obtain the upper bound of the expected regret of our algorithm and the
asymptotically tightest bound of the regret.
The two tricks used here provide some guidance for the theoretical analysis of tree-
based algorithms in general. Decision trees have shown superior performance in many
applications, while its statistical properties need further exploration. Using different
datasets for model training and response prediction is a common way to reduce the
bias from outliers. The consistency analysis is critical as it ensures that the algorithm
is asymptotically optimal. Dyadic decision trees can attain nearly optimal rates of
74
convergence and approximate complex decision boundary. The restriction to dyadic
splits makes the asymptotically property analysis more tractable.
75
76
Chapter 4
Uplift Modeling for Observational
Studies
4.1 Introduction
Though randomized experiment is the ideal method to collect data for treatment
selection, it is not the final answer and there are many restrictions that limit gener-
alizability. For example, random experiments are restricted to patients wit limited
disease, comorbidity, and concomitant medications. In these contexts, observational
studies, nonrandomized studies play a role. Nowadays, there is a growing interest in
using observational studies to estimate the treatment effects on responses due to the
widespread accumulation of data in fields such as healthcare, education, and ecology.
In observational studies, treatment selection is influenced by subject characteris-
tics. For example, we have an electronic health record dataset collected over several
years. For each patient, we have access to lab tests and past diagnoses of patients,
their medications, and responses, but we do not have complete knowledge of why a
specific medication was applied to a patient, which may highly depend on the patient's
characteristics. For example, richer patients might better afford certain medications,
or patients with a past disease cannot be applied to a specific medication. As a re-
sult, baseline characteristics of subjects often differ systematically between different
treatments.
77
X X
Uy Y T UY Y T
Figure 4-1: Causal graphical models of randomized experiment (left) and observationalstudies (right)
We use causal graphical model to illustrate the difference between randomized
experiment and observational studies in Figure 4-1. Causal graphical models are
tools to visualize causal relationships between variables [Elwert 2013]. Nodes are
variables and edges are causal relationships. In the Figure 4-1, Y represents the
response variable, X represents the features that can be observed but not influenced,
T represents the assigned treatment, and Uy represents all the unobserved exogenous
background factors that influence Y. Yet a great deal of research is aimed at discov-
ering the structure of the causal graph embedded in the observed data correlations
[Heckerman 1995] [Hackerman 1997], we assume the causal graph structure is a priori
given. Though our focus is not to deduce the causal links, causal graphs are sug-
gestive of bias and can provide a starting point of identifying variables that must be
measured and controlled to obtain unfounded effect estimates.
Unlike randomized experiments, observational studies do not automatically con-
trol for treatment selection biases. The treatments observed in the data depend on
variables which might also affect the response. For example, richer patients might bet-
ter afford certain medications, and job training might only be given to those motivated
enough to seek it. This gives rise to the challenge of untangling these confounding fac-
tors and making valid predictions. Therefore, statistical methods involving matching,
stratification, and covariance adjustment are needed.
78
Numerous methods exist to handle confounding, which includes propensity score
matching [Mani 2006], structural marginal models [Robins 2000], and g-estimation
[Bielbyy 19771. Doubly robust methods combine re-weighting the samples and co-
variate adjustment in clever ways to reduce model bias. These methods do not yield
themselves immediately to estimating an individual treatment effect, and predicting
optimal treatment afterwards. Adapting them for that purpose remains unresolved.
We will show in Section 4.2 that our algorithm solves this challenge by learning a rep-
resentation of data that makes distributions more similar, and training an ensemble
tree model on top of it.
4.2 Algorithm
Let us consider the case of binary treatment, where there is a sample of the population
selected and subjected to the treatment, called treatment set with t = 1, and another
selected sample is not applied by the treatment, called control set with t = 0.
Uplift modeling with observational data requires amending the direct modeling
approach to balance feature distributions. A naive way of obtaining a balanced rep-
resentation is to use only features that are already well balanced, i.e., features which
have a similar distribution over different treatments. However, imbalanced features
can be highly predictive of the response, and should not be discarded all the time.
A middle-ground is to restrict the influence of imbalanced features on the response.
Under linear assumptions, we can use re-weighting matrix to transform the feature
space and achieve a trade-off between its predictive capabilities and its balance.
However, in various conditions, linear assumptions are not satisfied and the un-
derlying data structure can be very complicated. Deep neural networks have been
shown to successfully learn good representations [Bengio 2013]. [Johansson 2016] put
forward a modification of the standard feed-forward architecture with fully connected
layers. The first few hidden layers are used to learn a representation 4D of the input
x. The layers following 4D take the treatment assignment as additional input and
generate a prediction of the response. This framework, called CFR, jointly learns the
79
response function for all treatments which minimizes a weighted sum of the loss, and
the Integral Probability Metric (IPM) distance between distribution under treatment
and control induced by the feature representation. This model allows for learning
complex non-linear representations and response functions with large flexibility.
We combine CFR and CTS to utilize their advantages at the same time to build
our model: the neural network architecture is to learn a feature representation forcing
balanced distributions among treatments, and the CTS modular takes the new feature
as an input to predict the optimal treatment. The model framework is shown in
Figure 4-2.qo(.) and qi(.) are the response function under treatment and control, L
is a loss function, and IPMG is an integral probability metric. The representation
D and response function q are to minimize a summation of predictive accuracy and
representation space imbalance, as the following objective This algorithm, called NN-
CTS, is outlined in Algorithm 4.
min wi -L(q((xi), ti), yi) + A -R(q) + a IPMG( (i)i:tt=O, {)(Xzi)}i:ti=1)
ti 1 - t- Nwith wi = -2 + 21 where p = - t + i,2 pt 2(1 -tp)N
and R is a model complexity term.
4.3 Experimental Results
Evaluating uplift modeling for observational studies is infeasible with real world data
as population distribution and ground truth of optimal treatment are both unknown,
thus we use synthetic datasets with known distribution and response function to
compare model performance.
We report a few more results to compare the response prediction accuracy. For
a subject x, and for each potential treatment t, we denote the potential response as
yt(r). The quantity Y(x) - Yo(x), called individualized treatment effect(ITE), is of
high interest. Another commonly sought after quantity is the average treatment effect,
80
L
if t =1 L( 0 ), y)
if t = 0 'I''( ,
IP MG =b ,LJ
?t1)
t - Optimal treatment t*
Figure 4-2: Neural network architecture for uplift modeling
ATE = Ex~,(x)[ITE(x)]. Some quantities of interest is the RMSE of the estimated
individual treatment effect, denoted EITE; the absolute error in estimated average
treatment effect, denoted EATE; and the Precision in Estimation of Heterogeneous
Effect(PEHE), PEHE = 1(() - )- (Y1 (x) - O(i)))2.
4.3.1 Synthetic Data 1
The feature space is the fifty-dimensional hyper-cube of length 10, or X50 = [0, 10]50.
Features are uniformly distributed in the feature space, i.e., Xd - U[ 0, 10 ], for d =
1, ..., 50. There are two treatments, T = 0, 1. The response under each treatment is
defined as below.
f(X) + U[0, aX1] + E if T = 0,
f(X) + U[O, aX2 ]+E ifT=1.
(4.1)
The response is the sum of three components.
o The first term f(X) defines the systematic dependence of the response on the
features and is identical for all treatments. Specifically, f is chosen to be a
81
Algorithm 4 NN-CTS - Contextual Treatment Selection with balanced features
Input: training data SN = (x 1 , t1 , y ',' , (N N yN ), scaling parameter a > 0,
loss function L(-, .), representation network 4Dw with initial weights W, outcome
network hv with initial weights V, function family G for IPM, number of samples
in each tree B (B < N), number of trees ntree, number of variables to be consid-
ered for a split mtry (1 < mtry < d), the minimum number of samples required
for a split min-split, the regularity factor n.reg
fraining:
1. Update the representation network weight W and outcome network weight
V until convergence criterion is met.
2. Train ntree trees using training data features transformed after representa-
tion network.
Prediction: Given a new point x in the feature space, the new feature xj is the
output of representation network <Dw, and the predicted expected response under
a treatment is the average of the predictions from all the trees. The optimal
treatment is the one with the largest predicted expected response.
mixture of 50 exponential functions so that it is complex enough to reflect real-
world scenarios.
f (x, ..., xso) (4.2)
50= a' - exp{--b'jxi - c'l- - - -b'IoX50 - Ciol},
i=1
where a', bj and cj are chosen randomly.
" The second term U[0, aX] is the treatment effect and is unique for each treat-
ment t. In many applications we would expect the treatment effect to be of
a lower order of magnitude of the main effect, so we set a to be 0.4 which is
roughly 5% of E[If(X)|].
" The third term E is the zero-mean Gaussian noise, i.e. c ~ N(0, a2 ). Note that
the standard deviation o- of the noise term is identical for all treatment. o is
82
set to 0.8 which is twice the amplitude of the treatment effect a.
Under this particular data model, the expected response is the same for all treatments,
i.e., E[YIT = t] = 5.18 for t = 0,1. The expected response under the optimal
treatment rule E[YIT = h*(X)] is 5.79.
Data under treatment 0 and 1 follow the following distributions:
P(XIT = 0) = U 49[0, 10] x Tri(0, 10, 0)1
P(XIT = 1) = U 49[0, 10] x Tri(0, 10, 10)
This is to simulate the real world situation that subjects are assigned to treatment
with different probability distribution based on their characteristics.
We compare the performance of 5 different methods. They are NN-CTS, CFR,
CTS, Separate Model Approach with Random Forest (SMA-RF), Support Vector Re-
gressor with Radial Basis Kernel (SMA-SVR). CFR is implemented as a feed-forward
neural network with 3 fully-connected exponential-linear layers for the representation
network and 3 for the outcome network. Layer sizes are 100 for all layers. The model
is trained using Adam [Kingma 20141. Layers corresponding to the outcome are reg-
ularized with a small 12 weight decay. These algorithms are tested under increasing
training data size, specifically 10000, 20000, 30000, 40000, 50000 and 60000 samples
per treatment. For each size, 10 training data sets are generated so that we can com-
pute the margin of error of the results. To ensure consistency in comparison, for each
data size, all the methods are tested with the same 10 datasets. The performance of
a model is evaluated using Monte Carlo simulation and the true data model.
The performance of the 5 methods are plotted in Fig. 4-3. For reference, we also
plot the single treatment expected response and the optimal expected response. The
vertical bars are the 95% margin of error computed with results from 10 training
datasets. Response prediction accuracy results are presented in Table 4.1. We can
see from Fig. 4-3 that specialized algorithms (NN-CTS, CRF, CTS) surpasses the
separate model approach, and the advantage continues to grow as the training size
'We use Tri(a, b, c) to denote a triangular distribution with lower limit a, upper limit b, andmode c.
83
- -Optimal ----- Single Trt -A-SMA-RF -0- SMA-SVR-O-CTS <> CFR -o-NN-CTS
5.7
c 5.60
_0
a> - - - .. ....-------( . .-.-........ -.---x .-------
a 55.....-~5.3
5.2 - -
5.110000 20000 30000 40000 50000 60000
Training Data Size (samples per treatment)
Figure 4-3: Averaged expected response of different algorithms on the synthetic datasimulating observational studies. The error bars are 95% margin of error computed withresults from 10 training datasets. For each data size, all algorithms are trained with thesame 10 datasets.
increases. The performance of NN-CTS is significantly better than CTS for all data
sizes. This shows the advantage of feature representation when data distributions
are different among treatments. NN-CTS outperforms over CFR though they both
use neural network to learn new features. The combination of neural network feature
representation and ensemble tree model for response prediction is promising to handle
real world problems.
84
Table 4.1: Standard errors of 10 repeated experiments for datasets simulating observationalstudies
To make a better visualization of feature distributions, we use t-distributed neigh-
bor embedding to project the high-dimensional feature vector to 2-dimensional space
and plot them in Fig. 4-4. Red and blue dots represent treatment 0 and 1 respec-
tively. Note the blue scattered points spread along the outermost circle in the left two
figures, they show the clear differences between feature distributions in the original
data, while this difference disappears in the feature representation shown in the right
two figures.
4.3.2 Synthetic Data 2
In this section, we consider a data set with the same feature space and response
function as in Section 4.3.1, except that the data under two treatment are sampled
from uniform distribution U' 0 [0,10]. This is a randomized experiment, but we will
show next the behaviors of different models.
We compare the performance of 5 different methods. They are NN-CTS, CFR,
CTS, Separate Model Approach with Random Forest (SMA-RF), Support Vector
Regressor with Radial Basis Kernel (SMA-SVR). These algorithms are tested under
increasing training data size, specifically 10000, 20000, 30000, 40000, 50000 and 60000
samples per treatment. For each size, 10 training data sets are generated so that we
85
CITE CATE PEHE
SMA-RF 2.4 0.1 0.7 0.1 4.1 0.2
SMA-SVR 1.9 0.2 0.6 0.1 3.7 i 0.2
CTS 1.7 0.1 0.4 0.1 2.9 0.1
CFR 1.4 0.1 0.3 0.1 2.1 + 0.1
NN-CTS 1.2 0.1 0.3 0.1 1.9 0.0
.0
original data (N=10000)
I-.
original data (N=60000)
VAA
feature representation (N=10000)
feature representation (N=60000)
Figure 4-4: Feature distribution after t-distributed stochastic neighbor embedding (t-SNE)
86
- -Optimal s--- Single Trt -- SMA-RF -O SMA-SVR-O-CTS c CFR +0- NN-CTS
5.7
5.600--
5.2.-- - -
a)
5.4
10000 20000 30 00 40000 50000 60000
Training Data Size (samples per treatment)
Figure 4-5: Averaged expected response of different algorithms on the synthetic datasimulating randomized experiments. The error bars are 95% margin of error computed withresults from 10 training datasets. For each data size, all algorithms are trained with thesame 10 datasets.
can compute the margin of error of the results. The performance of a model is
evaluated using Monte Carlo simulation and the true data model.
The performance of the 5 methods are plotted in Fig. 4-5. Response prediction
accuracy results when data size is 60000 are presented in Table 4.2. When data sizes
are small (10000 and 20000), NN-CTS and CFR perform better than CTS. This is
because even though data distribution under different treatments are the same, the
generated data may look differently when the data size is very small and feature space
is very large. Therefore, methods with feature representation help reduce this bias on
small dataset. This is helpful in real world as there may be no large datasets available
in some industry like insurance or at the beginning stage of an experiment. When
data sizes are large, NN-CTS and CFR suffer from the overestimate of unnecessary
87
feature representation.
Table 4.2: Standard errors of 10data
repeated experiments simulating randomized experiment
4.3.3 Simulation based on Real Data
A semi-simulated dataset based on the Infant Health and Development Program
(IHDP) was introduced by [Hill 2011]. The IHDP data studies the effect of high-
quality child care and home visits on future cognitive test scores in a real randomized
experiment. [Hill 2011] proposes an experiment which uses a simulated outcome and
artificially introduced imbalance between treated and control subjects by removing
a subset of the treated group. The dataset consists of 747 subjects in total, 139
in treatment group and 608 in control group. There are 25 covariates measuring
properties of the child and their mother for each subject. See Table 4.3 for results. It
shows that NN-CTS achieves lower errors than separate model approach and CFR.
Though we would still need to compare their performance of the expected response
on live experiment.
88
CITE CATE PEHE
SMA-RF 2.1 0.2 0.5 0.0 3.7 0.2
SMA-SVR 1.7 0.2 0.5 0.1 2.9 0.2
CTS 1.4 0.1 0.4 0.1 2.2 0.1
CFR 1.1 0.1 0.3 0.1 1.7 0.1
NN-CTS 1.1 0.1 0.2 0.1 1.5 0.0
Table 4.3: Standard errors for 10 repeated experiments on IHDP dataset
4.4 Summary
In this chapter, we present a model for uplift modeling with observational studies
data. As there is no automatic control for treatment selection bias, we need addi-
tional techniques to untangle the confounding factors. Seeing the success of neural
network in various applications of feature representation, we adopt a framework of
feed-forward architecture with fully connected layers in [Johansson 2016] to learn a
feature representation that jointly minimizes response prediction error and feature
distribution distance between treated and control group. CTS takes the new feature
as input and predicts the optimal treatment. Experimental results on synthetic data
show that the combination of neural network feature representation and ensemble
tree model can solve the challenges encountered in observational studies well. How-
ever, the evaluation with data collected in observational studies still remains an open
question.
89
CITE EATE PEHE
SMA-RF 3.2+0.2 0.8 0.0 4.9+0.2
SMA-SVR 2.8+0.2 0.5 0.0 3.7+0.2
CTS 1.8 0.1 0.3+0.0 2.1+0.1
CFR 1.7 0.0 0.3 0.0 1.6+0.1
NN-CTS 1.6t0.1 0.3+0.0 1.6 0.0
90
Chapter 5
Conclusion
5.1 Summary
We have seen uplift modeling put forward initially in the field of marketing and in-
surance, while it has much broader applications in many areas, such as healthcare,
economic and sociology. Data collected from randomized experiments and observa-
tional studies can both be used to analyze individualized treatment effect and achieve
uplift by personalized treatment assignment.
In this thesis, we first present the formulation of uplift modeling in a more general
framework. Based on this formulation, we put forward an unbiased estimate of the
expected response for an uplift model. This evaluation metric expands the scope
of uplift modeling as we are able to evaluate a broad range of models from binary
treatments to multiple treatments, from discrete responses to continuous responses.
Based on this evaluation metric, we present our algorithm, Contextual Treatment
Selection (CTS), for uplift modeling with randomized experiment data. This fills
a critical vacancy in the field of Uplift Modeling by being able to handle multiple
treatments and continuous response. Experimental results on synthetic and industry
provided data demonstrate that CTS can lead to significant increase in expected
response relative to other applicable methods. To further analyze the asymptotic
properties, we apply dyadic splits which splits each axis at the midpoint and obtain
the convergence rate of a generic CTS algorithm.
91
5.2 Future Work
As machine learning is becoming a major tool for researchers and policy makers
across different fields such as healthcare and economics, contextual treatment selection
becomes a crucial issue for the practice of machine learning. In most machine learning
problems, we assume that the underlying data model is stable. We also make this
assumption along the thesis discussion. However, things are always changing in real
world, and machine learning in non-stationary environment is a widely researched
topic. This issue also arises in uplift modeling problem. For example, in the airline
customized pricing problem, passengers' purchasing behavior can change over time.
Reasons can be but not limited to the business strategy change in the airline and
competing airlines, important holidays like Christmas/New Year, exogenous events
like terrorism or natural disasters, the general economic situations, etc. Fig. 5-1
shows how average revenue for different treatments changes over time from the start
of our randomized experiment. Without complete knowledge of these information, it
is usually difficult to incorporate them in models.
0.11
0.10
0.09
0.08 -
0.07
Apr 27 May 4 May 11
Date
Figure 5-1: The change of average revenue for different treatments over time with 90%confidence band (red: default price, blue: high price)
92
Although such a time heterogeneity exists, the change evolves relatively slowly
compared to the data collection procedure, as we can also see from Fig. 5-1 In
the case of seat reservation in Chapter 2, data collected from a 3-week randomized
experiment may be enough to train an uplift model that performs well for three
months at a satisfactory level. The common approach is to periodically retrain the
model with new data so that the model can always represent the latest user behavior
pattern. However, choosing the appropriate data collection period and model refresh
frequency needs to be treated carefully with further investigation of the problem itself
and may require a lot of trial and error. Designing a systematic way to update models
automatically is well worth exploring, which can help not only uplift modeling but
also more broader areas in machine learning.
Open questions which remain are how to generalize the method under observa-
tional studies for cases where more than one treatment is in question, deriving better
optimization algorithms and using richer discrepancy measures. Theoretical consider-
ations include the choice of IPM weight a, and the integration of our work with more
complicated causal models such as those with hidden confounding or instrumental
variables.
93
94
Appendix A
Parameter Selection for CTS
Here we describe the detail on parameter tuning that produces the results in Chapter 2
of the paper .
* CTS: Fixed parameters are the number of trees ntree=100, the number of fea-
tures considered at each split mtry=25 and the regularization factor nreg=3.
The minimum number of samples required for a split min-split is selected
among [25, 50, 100, 200, 400, 800, 1600, 3200, 6400] (Large values are
omitted when they exceeds dataset size). For synthesized data, min-split is
selected using 5-fold cross-validation when training data size is 500, 1000, 2000,
4000 and 8000 sample-per-treatment. Because of time-constraint, for sample
size 16000 and 32000, min-split is selected using validation (half training/half
test) on one data set and kept the same for the other nine data sets.
* RF: Fixed parameters are the number of trees nestimators=100, the number
of features considered at each split max_features=25. The minimum number
of samples in leaf node nodesize is tuned with 5-fold cross-validation among
[1,5,10,20].
* KNN: The number of neighbors nneighbors is tuned with 5-fold cross-validation
among [5,10,20,40].
" SVR: The regularization parameter C and the value of the insensitive-zone c are
95
determined analytically using the method proposed in [Cherkassky 2004]. The
spread parameter of the radial basis kernel -y is selected among [10-4, 10-3, 10-2, 10-1]
using 5-fold cross-validation.
. Ada: The number of estimators nestimators=100. Square loss.
96
Appendix B
Variable Selection
Here we describe how we select variables for the real world data in Section 2.3.
Variable selection is critical to real-world problems as we have little knowledge of the
underlying data model. New variables may bring new information to the prediction,
while irrelevant, redundant, or weak variables may cause overfitting and reduce the
interpretability of the model. We first extract meaningful variables from the raw data
(e.g. transaction record of each flight ticket purchase), and select the optimal variable
set using forward-greedy approach.
* Seat Reservation Data
booking hour, booking weekday,
travel hour, travel weekday,
number of days between the last login and the next flightAvailable variables
zone code, fare class,
days between the last login and the closest holiday,
group size, flight ticket price
booking hour, booking weekday,
travel hour, travel weekday,
Optimal variable set number of days between the last login and the next flight
zone code, fare class
Table B.1: Variable selection on seat reservation data
97
98
Appendix C
Parameter Selection for NN-CTS
We follow the approach described in C.1 of [Johansson 2016] for hyperparameter
selections. See Table C.1 for a description of hyperparameters and search ranges.
Table C.1: Parameters and ranges
99
Parameter Range
Imbalance parameter a { jk/2 }6
Number of representation layers {1, 2, 3}
Number of hypothesis layers {1, 2, 3}
Dimension of representation layers {20, 50,100, 200}
Dimension of hypothesis layers {20, 50,100, 200}
Batch size {100, 200, 500, 700}
100
Bibliography
[Agostino 1995] Ralph B. D'Agostino, and Heidy Kwan (1995). Measuring effective-ness: what to expect without a randomized control group. Medical Care, vol. 33,no. 4, pages 95-105.
[Agostino 19981 Ralph B. D'Agostino, Jr (1998). Propensity score methods for biasreduction in the comparison of a treatment to a non-randomized control group.Statistics in Medicine, 17, pages 2265-2281.
[Agrawal 1994] Rakesh Agrawal, and Ramakrishnan Srikant (1994). Fast algorithmsfor mining association rules. In Proceeding of VLDB Conference, pages 487-499.
[Alemi 2009] Farrokh Alemi, Harold Erdman, Igor Griva, and Charles H. Evans(2009). Improved Statistical Methods Are Needed to Advance PersonalizedMedicine. The open Translational Medical Journal, 1, 16-20.
[Alina 20111 Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and RobertE. Schapire (2011). Contextual bandit algorithms with supervised learning guar-antees. Proceedings of the International Conference on Artificial Intelligence andStatistics (AISTATS), JMLR Workshop and Conference Proceedings Volume 15.
[Angrist 20081 Joshua D Angrist and Jorn-Steffen Pischke (2008). Mostly harmlesseconometrics: An empiricist's companion. Princeton university press.
[Ascarza 2016] Eva Ascarza (2016). Retention futility: Targeting high risk customersmight be ineffective. Available at SSRN.
[Athey 2016] Susan Athey and Guido Imbens (2016). Recursive partitioning for het-erogeneous causal effects. Proceedings of the National Academy of Sciences, vol.113, no. 27, pages 7353-7360.
[Auer 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer (2002). Finite timeanalysis of the multi-armed bandit problem. Machine Learning, vol. 47, pages235-256.
[Auer 2003] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire(2003). The nonstochastic multi-armed bandit problem. SIAM Journal on Com-puting, vol. 32, no. 1, pages 48-77.
101
[Austin 2011] Peter C. Austin (2011). An introduction to propensity score methodsfor reducing the effects of confounding in observational studies. Multivariate be-
havioral research, vol. 46, no. 3, pp. 399-424.
[Bang 2005] Heejung Bang, James M. Robins (2007). Doubly robust estimation inmissing data and causal inference models. Biometrics, vol. 61, no. 4, pp. 962-973.
[Ben 2007] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira
(2007). Analysis of representations for domain adaptation. Advances in Neural
Information Processing Systems, pp. 137-144.
[Bengio 2013] Yoshua Bengio, Aaron Courville, Pascal Vincent (2013). Representa-
tion learning: A review and new perspectives. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 35(8):1798-1828.
[Bottou 2013] Leon Bottou, Jonas Peters, Joaquin Quinonero-Candela, Denis X.
Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, EdSnelson (2013). Counterfactual reasoning and learning systems: The example of
computational advertising. The Journal of Machine Learning Research, vol. 14,no. 1, pp. 3207-3260.
[Bielbyy 1977] William T. Bielby and Robert M. Hauser (1977). Structural equationmodels. Annual review of sociology vol. 3, pp. 137-161.
[Breiman 1984] Leo Breiman, Jerome Friedman, Charles J. Stone, and R. A. Olshen
(1984). Classification and regression trees. Wadsworth Statistics/Probability.
[Breiman 2001] Leo Breiman (2001). Random forests. Machine learning, vol. 45, no.
1, pp. 5-32.
[Chen] Xin Chen, Zack Owen, Clack Pixton, and David Simchi-Levi. A Statistical
Learning Approach to Personalization in Revenue Management.
[Chen 2016] Tianqi Chen and Carlos Guestrin (2016). Xgboost: A scalable tree boost-
ing system. arXiv:1603.02754.
[Cherkassky 2004] Vladimir Cherkassky, Yunqian Ma (2004). Practical selection ofSVM parameters and noise estimation for SVM regression. Neural networks, 17(1),113-126.
[Chickering 2000] David Maxwell Chickering and David Heckerman (2000). A de-cision theoretic approach to targeted advertising. Proceedings of the Sixteenth
Conference on Uncertainty in Artificial Intelligence, pp. 82-88.
[Cochran 1965] W. G. Cochran (1965). The planning of observational studies of hu-man populations. Journal of the Royal Statistical Society, 128, pp. 134-155.
[Cornia 20111 Marco Cornia, Kristopher S. Gerardi, Adam Hale Shapiro (2011). Price
Discrimination and Business-Cycle Risk. Atlanta, Federal Reserve Bank of At-
lanta.
102
[Cortes 2014] Corinna Cortes, and Mehryar Mohri (2014). Domain adaptation andsample bias correction theory and algorithm for regression. Theoretical ComputerScience, 519:103-126.
[Delgado 2014] M. Fernandez-Delgado, Eva Cernadas, Senen Barro, and DinaniAmorim (2014). Do we Need Hundreds of Classifiers to Solve Real World Classifi-cation Problems?, Journal of Machine Learning Research, vol. 15, pp. 3133-3181.
[Eisenstaein 2007] Eric L. Eisenstein, Kevin J. Anstrom, David F. Kong, Linda K.Shaw, Robert H. Tuttle, Daniel B. Mark, Judith M. Kramer, Robert A. Harring-ton, David B. Matchar, David E. Kandzari, Eric D. Peterson, Kevin A. Schulman,Robert M. Califf (2007). Clopidogrel use and long-term clinical outcomes afterdrug-eluting stent implantation. JAMA, 297:159-168.
[Elwert 2013] Felix Elwert (2013). Graphical causal models. In Handbook of causalanalysis for social research, pp. 245-273, Springer.
[Fisher 19351 Ronald A. Fisher (1935). The Design of Experiments, London: Oliverand Boyd.
[Flaxman 2005] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMa-han (2005). Online convex optimization in the bandit setting: gradient descentwithout a gradient. In Proceedings of the Annual ACM-SIAM Symposium on Dis-crete Algorithms, pp. 385-394.
[Freedman 1997] David Freedman (1997). From association to causation via regres-sion. Advances in Applied Mathematics, vol. 18, no. 1, pp. 59-110.
[Gey 2015] Servane Gey, and Elodie Nedelec (2005). Model selection for cart regres-sion trees. IEEE Transaction on Information Theory, vol. 51, no. 2, pp. 658-670.
[Greenland 1999] Sander Greenland, Judea Pearl, and James M. Robins (1999).Causal Diagrams for Epidemiologic Research. Epidemiology, 10, pp. 37-48.
[Guelman 2014a] Leo Guelman, Montserrat Guillenb, Ana M. Perez-Marin (2014a).A survey of personalized treatment models for pricing strategies in insurance.Mathematics and Economics, 58, pp. 68-76.
[Guelman 2014b] Leo Guelman, Montserrat Guillenb, Ana M. Perez-Marin (2014b).Optimal personalized treatment rules for marketing interventions: A review ofmethods, a new proposal, and an insurance case study. Technical report.
[Guelman 2014] Leo Guelman (2014). uplift: Uplift Modeling. R package version0.3.5. http://CRAN.R-project.org/package-uplift
[Hansotia 20021 Behram Hansotia and Brad Rukstales. Incremental value modeling(2002). Journal of Interactive Marketing, 16(3):35.
103
[Heckerman 1995] David Heckerman (1995). A bayesian approach to learning causalnetworks. In Proceedings of the Eleventh Conference on Uncertainty in ArtificialIntelligence, pp. 285-295.
[Hackerman 1997] David Heckerman (1997). Bayesian networks for data mining. DataMining and Knowledge Discovery, 1(1):79-119.
[Hill 2011] Jennifer L. Hill (2011). Bayesian nonparametric modeling for causal infer-ence. Journal of Computational and Graphical Statistics, 20(1).
[Holland 1986] Paul W. Holland (1986). Statistics and causal inference. Journal ofthe American statistical Association, vol. 81, no. 396, pp. 945-960.
[Holland 1988] Paul W. Holland and Dorothy T. Thayer (1988). Differential itemperformance and the mantel-haenszel procedure. Test Validity, pp. 129-145.
[Huupponen 2013] Risto Huupponen, and Jorma Viikari (2013). Statins and the riskof developing diabetes. BMJ, 346.
[Jaskowski 2012] Maciej Jaskowski, and Szymon Jaroszewicz (2012). Uplift modelingfor clinical trial data. In ICML Workshop on Clinical Data Analysis, 2012.
[Johansson 2016] Fredrik D. Johansson, Uri Shalit, David Sontag (2016). Learningrepresentations for counterfactual inference. In Proceedings of the 33rd Interna-tional Conference on Machine Learning (ICML).
[Kingma 2014] Diederik Kingma and Jimmy. Adam Ba (2014). A method for stochas-tic optimization. arXiv preprint arXiv:1412.6980.
[Kleinberg 2005a] Robert D. Kleinberg (2005a). Nearly tight bounds for thecontinuum-armed bandit problem. In Advances in Neural Information Process-ing Systems, pages 697-704.
[Kleinberg 2005b] Robert D. Kleinberg (2005b). Online Decision Problems with LargeStrategy Sets. PhD thesis, Massachusetts Institute of Technology, June 2005b.
[Kuusisto 20151 Finn C Kuusisto (2015). Machine Learning for Medical Decision Sup-port and Individualized Treatment Assignment. PhD thesis, University of Porto,2015.
[Lai 1985] T. L. Lai, and Herbert Robbins (1985). Asymptotically efficient adaptiveallocation rules. Advances in Applied Mathematics, 6(1):4-22.
[Li 2013] Jiuyong Li, Thuc Duy Le, Lin Liu, Jixue Liu, Zhou Jin, and Bingyu Sun(2013). Mining causal association rules. 2013 IEEE 13th International Conferenceon Data Mining Workshops, pages 114-123.
[Li 2015] Jiuyong Li, Thuc Duy Le, Lin Liu, Jixue Liu, Zhou Jin, Bingyu Sun, andSaisai Ma (2015). From observational studies to causal rule mining. ACM Trans-actions on Intelligent Systems and Technology (TIST), 7(2):14.
104
[Lo 2002] Victor S. Y. Lo (2002). The true lift model - A novel data mining ap-proach to response modeling in database management. ACM SIGKDD Explo-rations Newsletter, 4(2):78-86.
[Manahan 20051 Charles Manahan (2005). A proportional hazards approach to cam-paign list selection. In SAS User Group International (SUGI) 30 Proceedings.
[Mani 2006] Subramani Mani, Peter L. Spirtes, and Gregory F. Cooper (2006). Atheoretical study of Y structures for causal discovery. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence
[Mcclellan 1994] Mark McClellan, Barbara J. McNeil, and Joseph P. Newhouse(1994). Does more intensive treatment of acute myocardial infarction in the elderlyreduce mortality? Analysis using instrumental variables. JAMA, 272:859-866.
[Nassif 2013] Houssam Nassif, Finn Kuusisto, Elizabeth S Burnside, and Jude WShavlik (2013). Uplift modeling with roc: An srl case study. In ILP (Late BreakingPapers), pages 40-45.
[Neyman 1935] J. Neyman (1935). Statistical Problems in Agricultural Experiments.Journal of the Royal Statistical Society, vol. II, no. 2, pp. 107-180.
[Perchet 2013] Vianney Perchet, and Philippe Rigollet (2013). The multi-armed ban-dit problem with covariates.Annals of Statistics, vol 41, no. 2, pp. 692-721.
[Radcliffe 2007] Nicholas J. Radcliffe (2007). Using control groups to target on pre-dicted lift: Building and assessing uplift models. Direct Marketing AnalyticsJournal, An Annual Publication from the Direct Marketing Association AnalyticsCouncil, pages 14-21.
[Radcliffe 2008] Nicholas J. Radcliffe (2008). Hillstroms'mine that data email analyt-ics challenge: An approach using uplift modeling. Stochastic Solutions Limited,pages 1-19.
[Radcliffe 2011] Nicholas J. Radcliffe, and Patrick D. Surry (2011). Real-world upliftmodeling with significance-based uplift trees. White Paper TR-2011-1, StochasticSolutions (2011).
[Robins 2000] James Robins, Miguel Angel Hernan, and Babette Brumback (2000).Marginal structural models and causal inference in epidemiology. Epidemiology,pages 550-560.
[Rosenbaum 1983] Paul R. Rosenbaum, and Donald B. Rubin (1983). The centralrole of the propensity score in observational studies for causal effects. Biometrika,vol. 70, no. 1, pp. 41-55.
[Rosenbaum 2002] Paul R. Rosenbaum (2002). Observational studies. Springer.
105
[Rubin 1974] Donald B. Rubin (1974). Estimating causal effects of treatmentsin randomized and nonrandomized studies. Journal of educational Psychology,66(5):688.
[Rubin 2007] Donoal B. Rubin (2007). The design versus the analysis of observationalstudies for causal effects: parallels with the design of randomized studies. Statisticsin Medicine, 26:20-36.
[Rusmevichientong 2010] Paat Rusmevichientong, John N. Tsitsiklis (2010). Linearlyparameterized bandits. Mathematics of Operations Research, 35:395-411.
[Rzepakowski 2010] Piotr Rzepakowski, and Szymon Jaroszewicz (2010). Decisiontrees for uplift modeling, 2010 IEEE International Conference on Data Mining,pp. 441-450.
[Rzepakowski 2012] Piotr Rzepakowski, and Szymon Jaroszewicz (2012), Decisiontrees for uplift modeling with single and multiple treatments. Knowledge andInformation Systems, 32(2):303-327.
[Rzepakowski 2015] Michal Soltys, Szymon Jaroszewicz, and Piotr Rzepakowski(2015). Ensemble methods for uplift modeling. Data mining and knowledge dis-covery, 29:1531-1559.
[Sekhon 2008] Jasjeet Sekhon (2008). The Neyman-Rubin model of causal inferenceand estimation via matching methods. The Oxford handbook of political method-ology, pages 271-299.
[Silverstain 2000] Craig Silverstein, Sergey Brin, Rajeev Motwani, and Jeff Ullman(2000). Scalable techniques for mining causal structures. Data Mining and Knowl-edge Discovery, 4(2-3):163-192.
[Scott 2005] Clayton Scott (2005). Tree pruning with subadditive penalties. IEEETransaction on Signal Processing, vol. 53, no. 12, pp. 4518-4525.
[Stukel 2007] Therese A. Stukel, Elliott S. Fisher, David E. Wennberg, David A.Alter, Daniel J. Gottlieb, Marian J. Vermeulen (2007). Analysis of observationalstudies in the presence of treatment selection bias: effects of invasive cardiacmanagement on AMI survival using propensity score and instrumental variablemethods. JAMA. 2007;297:278-285.
[Su 2012] Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan(2012). Facilitating score and causal inference trees for large observational studies.The Journal of Machine Learning Research, 13(10), 2955-2994.
[Wager 2017] Stefan Wager, and Susan Athey (2017). Estimation and inference ofheterogeneous treatment effects using random forests. Journal of the AmericanStatistical Association, forthcoming.
106
[Wang 2005a] Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor (2005a).Arbitrary side observations in bandit problems. Advances in Applied Mathematics,vol. 34, no. 4, pp. 903-938.
[Wang 2005b] Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor (2005b).Bandit problems with side observations. IEEE Transactions on Automatic Con-trol, vol. 50, no. 3, pp. 338-355.
[Wong 1978] Stanley Wong (1978). Foundations of Paul Samuelson's Revealed Pref-erence Theory: A Study by the Method of Rational Reconstruction. Routledge,ISBN 0-7100-8643-1.
[Zaniewicz 2013] Lucas Zaniewicz and Szymon Jaroszewicz (2013). Support vectormachines for uplift modeling. IEEE 13th International Conference on Data MiningWorkshops, pp. 131-138.
[Zhao 2017] Yan Zhao, Xiao Fang, and David Simchi-Levi (2017). A Practically Com-petitive and Provably Consistent Algorithm for Uplift Modeling. IEEE Interna-tional Conference on Data Mining.
107