redacte - dspace.mit.edu

Uplift Modeling for Randomized Experiments and

Observational Studies

by

Xiao Fang

MASSACHUSETTS INSTITUTEOF TECHNOLOGY

MAR 26 2018

LIBRARIES

B.S., University of Science and Technology of China (2011)S.M., Massachusetts Institute of Technology (2013)

Submitted to the Department of Electrical Engineering and ComputerScience

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

February 2018

o Massachusetts Institute of Technology 2018. All rights reserved.

Author ......................... redacteDepartment of Electrical Engineering and Computer Science

December 1, 2017

Certified by........... Sig tue edac d ...............David Simchi-Levi

Professor, Institute of Data, Systems, and Society,Department of Civil and Environmental Engineering,

Operations Research CenterThesis Supervisor

Accepted by .............. Signature redacted.L e'liY A. Kolodziejski

Professor of Electrical Engineering and Computer ScienceChair, Department Committee on Graduate Students

ARCHIVES

Uplift Modeling for Randomized Experiments and

Observational Studies

by

Xiao Fang

Submitted to the Department of Electrical Engineering and Computer Scienceon December 1, 2017, in partial fulfillment of the

requirements for the degree ofDoctor of Philosophy in Computer Science

Abstract

Uplift modeling refers to the problem where we need to identify from a set of treat-ments the candidate that leads to the most desirable outcome based on subject char-acteristics. Most work in the last century focus on the average effect of a treatmentacross a population of interest, but ignores subject heterogeneity which is very com-mon in real world. Recently there has been explosion of empirical settings whichmakes it possible to infer individualized treatment responses.

We first consider the problem with data from randomized experiments. We putforward an unbiased estimate of the expected response, which makes it possible toevaluate an uplift model with multiple treatments. This is the first evaluation met-ric of uplift models aligning with the problem objective in the literature. Based onthis evaluation metric, we design an ensemble tree-based algorithm (CTS) for upliftmodeling. The splitting criterion and termination conditions are derived with theconsideration of the special structure of uplift modeling problem. Experimental re-sults on synthetic data and industry data show the advantage of our specialized upliftmodeling algorithm over separate model approach and other existing uplift modelingalgorithms.

We next prove the asymptotic properties of a simplified CTS algorithm. Theexhaustive search for locally optimal splitting points makes it difficult to theoreticallyanalyze tree-based algorithms. Thus we adopt dyadic splits to CTS algorithm andobtain the bound of regret-expectation of performance difference between oracle andour algorithm. The convergence rate of the regret depends on the feature dimension,which emphasizes the importance of feature selection. While model performanceusually improves with the number of features, it requires exponentially more datato approximate the optimal treatment rule. Choosing the appropriate complexity ofthe model and selecting the most powerful features are critical to achieving desirableperformance.

Finally we study the uplift modeling problem in the context of observational stud-ies. In observational studies. treatment selection is influenced by subject characteris-tics. As a result. baseline characteristics often differ systematically between different

3

treatments. Thus confounding factors need to be untangled before valid predictionsare made. We combine a modification of the standard feed-forward architecture withour CTS algorithm to optimize predictive accuracy and minimize feature distributiondistance between treatment. Experimental results on synthetic data show that thecombination of neural network feature representation and ensemble tree-based modelis promising to handle real-world problems.

Thesis Supervisor: David Simchi-LeviTitle: Professor, Institute of Data, Systems, and Society,Department of Civil and Environmental Engineering,Operations Research Center

4

This thesis is dedicated to my grandmother, Caichun Hong,

January 12, 1933-September 1, 2017.

5

Acknowledgments

This thesis is a reflection of my rewarding journey at MIT. I am fortunate and privi-

leged to be accompanied and supported by many incredible people.

First and foremost, I would like to express my deepest gratitude to my advisor,

Professor David Simchi-Levi. His keen vision of research and high professional stan-

dard has made a huge influence on me. David has taught me how to be an effective

researcher and the art of guiding research projects from start to finish. David gave

me a lot of freedom in exploring new ideas and experimenting with new methods,

while provided critical insights to guide the direction. This thesis would not have

been possible without his guidance, support and encouragement.

I would like to thank my thesis committee members, Prof Peter Szolovits and

Prof Luca Daniel, for taking time to serve on my committee, and for their critical

insights to help improve the quality of this thesis. I would greatly appreciate their

commitment and flexibility during such a busy schedule.

I am also lucky to have collaborated with wonderful people during my PhD re-

search life. The work in Chapter 2 and 3 was done in close collaboration with Yan

Zhao. We together discussed research, programmed codes and wrote papers. We have

been through a lot of ups and downs, paper rejections and acceptances. Conversa-

tions with Yan either on research or in life always brought me a lot of inspiration and

fun. I can not ask for a better partner and friend than her. This thesis is funded

by MIT-Accenture Alliance, therefore I have the opportunity to work with numerous

amazing people at Accenture - Jason Coverston, Matthew O'Kane, and Andrew Fano.

Their perspective in the business field helped shape the project more established.

I would like to express my gratitude to all my friends, locally available and geo-

graphically separated. Their friendship and support make my life always enjoyable.

Special thanks to Xue Feng, my ex-officemate, ex-roommate, and to-be-workmate,

who has become my closest friend after traveling the world together, and sharing

moods at different stages of life. To my friends at MIT: Yin Wang, Tianli Zhou,

Linsen Chong, Nate Bailey, Yiqun Hu and Peter Zhang, it is my great pleasure to

6

share the office space, together with so much joy, with them. All the days and nights

spent with them in and outside the office are the best memories in my PhD life.

Finally, I owe the deepest appreciation to my parents, for their unconditional love

and constant support. They instilled me the value of education and love for learning

at a very young age. They always encourage me to pursue my dream with their best

wishes, even though I have to be thousands of miles away. None of my achievements

would be possible without their understanding.

7

Contents

1 Introduction 17

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.1.1 Uplift Modeling Problem . . . . . . . . . . . . . . . . . . . . . 17

1.1.2 Challenges and Literature Review . . . . . . . . . . . . . . . . 18

1.2 O verview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Uplift Modeling for Randomized Experiments 27

2.1 Problem Formulation and Model Evaluation . . . . . . . . . . . . . . 27

2.1.1 Problem Formulation and Notation . . . . . . . . . . . . . . . 27

2.1.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2 The CTS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.1 Advantage of CTS Over SMA . . . . . . . . . . . . . . . . . . 37

2.3.2 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3.3 Priority Boarding Data . . . . . . . . . . . . . . . . . . . . . . 43

2.3.4 Seat Reservation Data . . . . . . . . . . . . . . . . . . . . . . 47

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3 Theoretical Analysis 55

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.1.1 Multi-Armed Bandit Problem . . . . . . . . . . . . . . . . . . 55

3.1.2 Tree-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . 56

3.2 Consistency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

9

3.3 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.1 Algorithm CTS.0 . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.3.2 Upper Bound on Expected Regret . . . . . . . . . . . . . . . . 67

3.4 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Uplift Modeling for Observational Studies 77

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 A lgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.1 Synthetic Data 1 . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3.2 Synthetic Data 2 . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.3.3 Simulation based on Real Data . . . . . . . . . . . . . . . . . 88

4.4 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Conclusion 91

5.1 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A Parameter Selection for CTS 95

B Variable Selection 97

C Parameter Selection for NN-CTS 99

10

List of Figures

2-1 An example of uplift curve [Rzepakowski 2012] . . . . . . . . . . . . . 30

2-2 True decision boundary of optimal treatment on the 2-D example . . 38

2-3 Comparison of performance between different approaches on the 2-D

exam ple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2-4 Comparison of decision boundary between different approaches on the

2-D exam ple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2-5 Averaged expected response of different algorithms on the synthetic

data. The error bars are 95% margin of error computed with results

from 10 training datasets. For each data size, all algorithms are trained

with the same 10 datasets. . . . . . . . . . . . . . . . . . . . . . . . . 42

2-6 Change in revenue per passenger with respect to passenger features . 44

2-7 Expected revenue per passenger from priority boarding based on dif-

ferent m odels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2-8 Modified uplift curves of different algorithms for the priority boarding

d ata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2-9 Expected summed response per passenger from priority boarding with

the consideration of user utility. . . . . . . . . . . . . . . . . . . . . . 48

2-10 Expected response per passenger from priority boarding with the con-

sideration of user utility. . . . . . . . . . . . . . . . . . . . . . . . . . 49

2-11 Expected revenue per passenger from seat reservation when applying

different pricing models. . . . . . . . . . . . . . . . . . . . . . . . . . 51

2-12 Modified uplift curves of SMA-RF and CTS on the seat reservation data. 52

11

3-1 The optimal treatment rule for the 2D example. The vertical boundary

along X1 axis is located at X1 = 50. The vertical axis X2 is a discrete

variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3-2 The treatment rule reconstructed by UCTS and CTS for the 2D ex-

ample. Plots on each are generated from the same training set. The

horizontal axis is feature X1 and the vertical axis feature X2 . The la-

bels and ticks of the axes are the same as those in Fig. 3-1 and omitted

here for sim plicity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3-3 Average expected response under UCTS and CTS models for the 2D

example computed from 50 training sets with 95% confidence interval

shown as the error bar. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3-4 Illustration of dicretization error in 2-D space. The diagonal curve

represents the true decision boundary, with left bottom part assigned

to treatment yellow and right upper part assigned to treatment blue. 64

3-5 An example of feature space partition and treatment assignment for

Algorithm CTS.0 when Xd = [0, 1]2 and M = 10. Subspace #1 is

assigned to treatment yellow and subspace 02 to an arbitrary treatment. 65

4-1 Causal graphical models of randomized experiment (left) and observa-

tional studies (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4-2 Neural network architecture for uplift modeling . . . . . . . . . . . . 81


data simulating observational studies. The error bars are 95% margin

of error computed with results from 10 training datasets. For each

data size, all algorithms are trained with the same 10 datasets. ..... 84

4-4 Feature distribution after t-distributed stochastic neighbor embedding

(t-SN E ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

12


data simulating randomized experiments. The error bars are 95% mar-

gin of error computed with results from 10 training datasets. For each

data size, all algorithms are trained with the same 10 datasets. ..... 87

5-1 The change of average revenue for different treatments over time with

90% confidence band (red: default price, blue: high price) . . . . . . . 92

13

List of Tables

1.1 Literature on uplift modeling . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Standard errors of 10 repeated experiments for datasets simulating

observational studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2 Standard errors of 10 repeated experiments simulating randomized ex-

perim ent data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.3 Standard errors for 10 repeated experiments on IHDP dataset . . . . 89

B. 1 Variable selection on seat reservation data . . . . . . . . . . . . . . . 97

C.1 Parameters and ranges . . . . . . . . . . . . . . . . . . . . . . . . . . 99

15

Chapter 1

Introduction

1.1 Background

1.1.1 Uplift Modeling Problem

In real world, we often face the treatment selection problem, where we need to identify

from a set of alternatives the candidate that leads to the most desirable outcome. For

example, airline companies need to set the price that generates the highest revenue

for priority boarding; doctors want to know which medication is the most effective to

reduce blood pressure.

Most work in the last century focus on the average effect of a treatment across a

population of interest, but ignores the subject heterogeneity which is very common

in real world. A price level that leads to the highest revenue might tend to put off

passengers in some subpopulation. A medical treatment that is the most effective

over the entire patient population may be ineffective or even detrimental for patients

with certain condition. This limitation is because historically most datasets have

been too small to meaningfully explore heterogeneity of treatment response, while

recently there has been explosion of empirical settings which makes it possible to

infer individualized treatment responses based on subject characteristics.

The prediction of optimal treatment based on subject characteristics is of great

interest and critical applications. This personalized treatment selection problem, also

17

referred as Uplift Modeling in the literature, is the main topic of this thesis.

1.1.2 Challenges and Literature Review

Neyman [Neyman 1935] and Fisher [Fisher 1935] discussed the use of randomized ex-

periments to attribute difference in responses to different treatments. Randomized

experiments, also called randomized controlled trials, are considered the gold stan-

dard approach for estimating treatment effects. In randomized experiment, subjects

are randomly assigned a treatment and their responses are recorded. Uplift model-

ing also crucially relies on randomized experiments. Random treatment allocation

ensures that treatment status will not be confounded with either measured or un-

measured baseline characteristics. Therefore, the treatment effects can be estimated

by comparing responses between different treatments [Greenland 1999].

At the same time, a widespread accumulation of observational data is available

in many fields such as healthcare, economics, and education. For data collected in

observational studies, we do not have control or even a full understanding of the

mechanism how treatment is assigned to each subject, while subject characteristics,

treatment, and outcomes are all recorded. For example, a dataset records patients'

diabetic status to one of two anti-diabetic medications, as well as other lab test re-

sults. However, which medication is taken by the patient may depend on the patient's

gender/past diagnose/economic level, which is unknown to us. Though without con-

trol of the treatment assignment procedure, we still want to make the use of these

data to predict the optimal treatment based on given subject characteristics as well,

which needs to be taken more care of compared to data obtained from randomized

experiments.

Despite the fact that every training example is accompanied with a response value,

uplift modeling is different from conventional classification/regression problems. This

key difference poses unique challenges. In classification/regression problem, one needs

to train different models and select the one that yields the most reliable prediction. It

requires sensible cross-validation strategies along with potential feature engineering,

where we separate data into a training set and a test set, learn on the training

18

data, predict response on the test data and compare to the ground truth. However,

data collected in uplift modeling is unlabeled because we can only observe response

under the assigned treatment, and would never know responses under alternative

treatments, thus it is impossible to know whether the chosen treatment is optimal

for any individual subject. This is known as the Fundamental Problem of Causal

Inference [Holland 19861. The lack of the true value of the quantity that we are

trying to predict (the optimal treatment) makes it impossible to use standard machine

learning algorithms to estimate the optimal treatment directly.

One approach to Uplift Modeling is the Separate Model Approach (SMA). Data

is separated into groups by treatments, and a prediction model is built for each

treatment. When a new subject arrives, we assign the treatment that has the best

predicted response.

The main advantage of this approach is its simplicity. It does not require devel-

opment of new algorithms and software, and any state-of-the-art machine learning

algorithms can be used in either a regression setting or a classification one. If uplift,

the difference in response between treatments, is strongly correlated with the class

attribute itself, or if the amount of training data is sufficient for the models to predict

response accurately, SMA will perform very well. We have seen applications of SMA

in [Hansotia 2002] for direct marketing and in [Manahan 2005] for customer retention.

However, when the uplift signal is much weaker than the response and the uplift

follows a distinct pattern from the response, SMA is often outperformed by other

methods [Lo 2002][Radcliffe 2011]. One reason is the mismatch between training ob-

jective and problem purpose. SMA aims to train models which can accurately predict

the response under a single treatment, what really matters is accurately identifying

the treatment with best response. Since data is always noisy and insufficient, there

is error for each model trained under a single treatment, and these errors will be ac-

cumulated when estimating uplift after the comparison of these predicted responses

and selecting the optimal treatment afterwards. [Radcliffe 20111 illustrates this phe-

nomenon in a simulation study.

SMA will also lead to misguided feature selection. For example, the separate model

19

will prioritize features that have high predictive power for the response of individual

treatment, while ignore features that best predict the difference in responses across

treatments. When the cross-treatment feature effect is relatively weak compared to

the individual-treatment feature effect and when the training data is not abundant,

the separate model approach will focus more on the individual-treatment feature

effect, which is not what we want.

Because of the disadvantage of the Separate Model Approach, a number of al-

gorithms have been proposed to directly model the uplift effect. One approach is

the Class Transformation method introduced in [Jaskowski 20121. In the case of bi-

nary response variable and balanced dataset between treatments, it creates a new

target variable, which can be calculated from training data, and builds a relation-

ship between the new variable and individualized treatment effect. However the two

assumptions make it too restrictive. [Athey 2016] generalized it to unbalanced treat-

ment assignment and continuous response variables by the incorporation of propensity

score.

A lot more research aims to model uplift directly through the modification of

well-known classification machine learning algorithms, both parametric and non-

parametric methods. [Lo 2002] proposed a logistic regression formulation which ex-

plicitly includes interaction terms between features and treatments. [Zaniewicz 20131

adapts Support Vector Machine to predict the treatment effect. Nonparametric meth-

ods are also broadly used and have achieved good performance for real world applica-

tions. [Alemi 20091 and [Su 20121 used an adaption of K-Nearest Neighbors for uplift

modeling. Treatment assignment is based on the empirically best one as measured on

the K training data that are closest to each subject. The most popular and successful

approach is tree-based algorithms. [Chickering 2000] modifies the standard decision

tree construction procedure [Breiman 1984] by forcing a split on the treatment at each

leaf node. [Hansotia 2002] employs a splitting criterion that maximizes the difference

between the difference between the treatment and control probabilities in the left

and right child nodes. [Rzepakowski 2010] chooses splitting points that maximize the

distributional difference between two child nodes as measured by weighted Kullback-

20

Leibler divergence and weighted squared Euclidean distance. [Radcliffe 2011] fits a

linear model to each candidate split and uses the significance of the interaction term

as a measure of the split quality. [Guelman 2014a] selects the variable that has the

smallest p-value in the hypothesis test on the interaction between the response and

itself as the splitting variable. Then the splitting point is selected to maximize the

interaction effect. [Rzepakowski 20151 uses Bagging or Random Forest on uplift trees

and demonstrates experimentally that it results in significant performance improve-

ment.

However, all the above research can only handle binary treatment, which limits

their applications to real world problems. Literature on the more general multi-

treatment uplift problem is limited. The causal K-nearest neighbors originally in-

tended for single treatment in [Alemi 20091 and [Su 2012] can be naturally gener-

alized to multiple treatment. [Rzepakowski 2012] extends the tree-based algorithm

described in [Rzepakowski 2010] to multiple treatment cases by using the weighted

sum of pairwise distributional divergence as the splitting criterion. [Chen] proposes

a multinomial logit formulation in which treatments are incorporated as binary fea-

tures and explicitly includes the interaction terms between treatments and features.

The finite sample convergence guarantees for model parameters and out-of-sample

performance guarantee is also presented. Both [Rzepakowski 2012] and [Chen] can

handle both binary and discrete responses. In Table 1.1, we list existing literature in

this field and classify them on the responses being binary, discrete, or continuous and

whether there is single treatment or multiple treatments. This thesis fills up the va-

cancy with being able to handle multiple treatments and either discrete or continuous

responses.

21

Single treatmentMultiple treatments

(treatment vs control)

Chickering (2000), Lo (2002),

Hansotia (2002), Alemi (2009),Binary Rzepakowski (2012),

Rzepakowski (2010), Radcliffe (2011),Response Chen et. al (2015)

Su (2012), Zaniewicz (2013),

Guelman (2014)

Discrete Alemi (2009), Radcliffe (2011), Rzepakowski (2012),

Response Rzepakowski (2012), Su (2012) Chen et. al (2015)

Continuous Alemi (2009), Su (2012),This Thesis

Response Radcliffe (2011)

Table 1.1: Literature on uplift modeling

One critical challenge with uplift modeling is how to accurately evaluate model

performance off-line using experimental data given the counter-factual nature of the

problem. In the current literature, people used qini curves [Radcliffe 20071 and uplift

curves [Rzepakowski 2010] in the case of binary treatment. The uplift curve measures

the gain in cumulative response as a function of the percentage of population targeted.

This is useful to see if the treatment has a global positive or negative effect by targeting

part of the population, but we are not able to tell the expected response if treatments

are assigned following the uplift model. Being unable to measure the key quantity

that the problem actually aims to maximize makes it impossible to do parameter

tuning and model selection.

We face further challenges when encountered with data from observational stud-

22

ies. Cochran defined an observational study to be an empirical investigation in which

the "objective is to elucidate cause-and-effect relationships ... [in settings in which]

it is not feasible to use controlled experimentation, in the sense of being able to im-

pose the procedures or treatments whose effects it is desired to discover, or to assign

subjects at random to different procedures" [Cochran 19651. Lack of randomization

in observational studies may result in large difference on the observed subject charac-

teristics between different treatments. These differences can lead to biased estimates

of treatment effect. The observed treatments depend on variables which might also

affect the response, and thus this results in confounding.

Propensity score, first defined by [Rosenbaum 1983] as the probability of treatment

assignment conditional on subject characteristics, allows one to design and analyze an

observational study so that it mimics a randomized experiment. The propensity score

is a balancing score so that in a set of subjects whose propensity scores are the same,

the observed subject characteristics distribution will be the same between different

treated groups. While in randomized experiments the true propensity score is defined

by the study design, it is not known in observational studies. Logistic regression is

frequently used to estimate this score, in which the treatment variable is the outcome

and subject characteristics are the predictor variables. Different methods using the

propensity score, like matching, stratification, or regression adjustment, can produce

unbiased estimates of the treatment effects [Austin 20111. However, these methods

can only estimate average treatment effect over the whole population rather than

an individual level effect, and they do not attempt to directly model the relationship

between characteristics, treatments, and responses. Seeing these limitations, we would

like to have a method which can integrate feature balancing and response predicting

well and be able to predict optimal treatment at an individual level

1.2 Overview

We now describe the contribution of this thesis. First we put forward an unbiased esti-

mate of the expected response for an uplift model with data collected from randomized

23

experiments. This evaluation metric works for arbitrary number of treatments and

general response types, and does not require the treatment probability to be evenly

distributed. This opens up the application of uplift modeling to broader areas. We

further introduce the modified uplift curve which provides a consistent illustration of

the uplift ability of a model.

We then propose a tree-construction procedure with a splitting criterion that ex-

plicitly optimizes the performance of the tree as measured on the training data and

a set of termination conditions with the consideration to avoid some extreme cases.

An ensemble of trees is used to mitigate the overfitting problem that commonly hap-

pens with a single tree. We refer to our algorithm as the CTS algorithm where

the name stands for Contextual Treatment Selection. The performance of CTS is

tested on three benchmark data sets. The first is a 50-dimensional synthetic data set.

The latter two are randomized experiment data provided by our industry collabora-

tors. On all of the data sets, CTS demonstrates superior performance compared to

other applicable methods which include Separate Model Approach with Random For-

est/Support Vector Regression/K-Nearest Neighbors/AdaBoost, and Uplift Random

Forest (upliftRF) as implemented in the R uplift package[Guelman 2014J.

To obtain the asymptotic properties of the tree-based algorithm, we adopt the idea

of dyadic decision trees, which split only at the midpoint of each dimension. This

family of decision trees can approximate complex decision boundaries, while makes

the theoretical analysis tractable. With the help of the special structure of dyadic

decision trees, we prove that a simplified algorithm is asymptotically optimal under

mild regularity conditions, with a convergence rate dependent on the data size and

feature dimension.

In contrast to expensive experimental data, observational data is often available

in much larger quantities especially in fields such as medicine, economic, or sociology.

In the case of observational studies, we use a modified neural network architecture to

achieve a balanced distribution. The new features are fed as input to the ensemble

tree model to predict optimal treatment. Experimental results on synthetic data

show that the combination of neural network feature representation and tree-based

24

uplift modeling has achieved better performance than state-of-the-art approaches.

Moreover, it has shown advantage for randomized experiment data when training

dataset is small. This property is particularly helpful at the early stage of randomized

experiments.

The remainder of the thesis is organized as follows. In Chapter 2, we first introduce

the formulation of the multi-treatment uplift modeling, and present the unbiased esti-

mate of the expected response for an uplift model, then describes the CTS algorithm

in detail. The setup and the results of the experimental evaluation are presented

afterwards. The modified uplift curve is also introduced in this Chapter. Chapter 3

presents a generic treatment selection algorithm and the analysis of its convergence

property. We then introduce the algorithm to combine feature representation and

CTS for uplift modeling under observational studies in Chapter 4. Finally we provide

some concluding remarks in Chapter 5.

25

Chapter 2

Uplift Modeling for Randomized

Experiments

2.1 Problem Formulation and Model Evaluation

We first describe the mathematical formulation of uplift problems and the notation

used throughout this paper.'A new method to evaluate uplift modeling is introduced

afterwards.

2.1.1 Problem Formulation and Notation

During the experiment to collect data, we record features X E Xd, the assigned

treatment T E {1, , K}, and the response Y for each subject. In the feature

vector, we use subscripts indicate specific features, i.e. X, is the jth feature and xj

its realization. For each subject, there are K potential responses under each potential

treatment, denoted by the subscripts Y1 , Y2 , ..., YK. The larger the value of Y, the

more desirable the outcome. The value of Y is decided by X and T, and we care

about the conditional expectation

p(x, t) = E[YIX = x, T = t] (2.1)

'Upper case letters are used to denote random variables and lower case letters their realizations,boldface for vectors and normal typeface for scalars.

27

In the price optimization example mentioned in the Introduction where the com-

pany wants to maximize the revenue, X would be passenger's characteristics such as

flight date/time, flight fare, the number of people traveling together, etc; T would be

different price levels; and Y would be the revenue .

We make an assumption that the treatment assignment T is independent of the

K potential outcomes conditional on X. This property can be expressed as uncoun-

foundness [?]

{Yl, Y2 , ... , YK} _UI T | X

Unconfoundness makes the estimation of counterfactual outcomes from nearest-neighbor

matching and other local methods consistent in general.

Suppose we have a dataset SN = { ( X(), t( , y(i) ), i = 1, ... , N } containing N

joint realizations of (X, T, Y) from a randomized experiment. Superscript (i) is to

index the samples. The objective of uplift modeling is to learn a mapping h from

the feature space to the treatment space: Xd -+ {0, 1,... , K} such that the expected

response conditional on this mapping E[YIT = h(X)] is maximized.

As is with classification and regression problems, the optimal expected response

achievable in an uplift problem is determined by the underlying data generation pro-

cess. The optimal treatment assignment rule h* satisfies the point-wise optimal con-

dition, i.e., Vx E Xd,

h*(x) E arg max E[YX = x, T = t]. (2.2)t=1,...,K

We would like to learn an uplift model with a treatment rule h to be close h*.

An evaluation metric is prerequisite to measure the gap between these two and help

model training.

2.1.2 Model Evaluation

Evaluation of uplift models is not easy compared to machine learning in the more

standard supervised setting. In uplift modeling, we can only observe the subject

28

response under one and only one treatment, and will not know the ground truth-

optimal treatment for each subject. The only way to obtain ground truth is to actually

run the algorithm on live experiment, which is economically inefficient or practically

unfeasible. Even when it is possible to apply the model in live experiment, it would

be helpful if we could get a sense of how approximately the model will work before

conducting live experiment. Therefore, we need an evaluation metric using offline

data collected at a previous randomized experiment.

Because of the impossibility to observe responses under all the treatments for

an individual, it is difficult to find a loss measure for each observation. As a re-

sult, most of the uplift literature resort to aggregated measures such as bins of up-

lifts in [Ascarza 2016], uplift curves in [Rzepakowski 2012], [Rzepakowski 20151 or

[Nassif 2013], and qini measure in [Radcliffe 2007]. Uplift curve is put forward in the

context of binary treatment, where there is a sample of the population selected and

subjected to the treatment, called treatment set, and another selected sample is not

applied by the treatment, called control set. We set aside a test set for treatment

and control group and score them using the model. We subtract the response on the

p percent highest scored subjects in the control set from the response on the highest

scored p percent subjects in the treatment set as an estimation of the gain if perform-

ing the treatment on the p most preferable subjects. The result of such a subtraction

is called an uplift curve, with an example shown in Fig. 2-1. The straight line with a

positive slope represents the beneficial global effect if the whole population is offered a

treatment randomly. In the uplift curve, each point corresponds to the inferred uplift

gain. The higher this value, the better the model. Uplift curve is useful to see if the

treatment has a global positive or negative effect and help us choose the percentage

that maximizes the gain as the limit of the population to be targeted. The continuity

of the uplift curves makes it possible to calculate the area under the curve as a way

to evaluate and compare different uplift models. This measure is thus similar to the

well-known AUC (Area under ROC curve) in a binary classification setup.

The problem with uplift curve is that there is no guarantee that the highest scoring

examples in the treatment and control groups are similar. Moreover, uplift curve is

29

18- Euclid

16- -- KL.DoubleTree

14 - -- DeltaDeltaP'- "-~~14. .. . .. . ......

2 - --- --- ------ ------- --- ------ -.-- --- ---------- -------------- ---------- --- -

CL-

0

E 6 - - - -........ . .... .. .... ..... ... ..... .... .. . .. ..... .

U

S20 40 60 so 100Treated objects [%]

Figure 2-1: An example of uplift curve [Rzepakowski 2012]

not an ideal evaluation metric as it does not measure the key quantity of interest-the

expected response. Take the airline pricing for example. Suppose in training data,

the average revenue for the control group is $4.5 and that for the treatment group is

$5.1. We are not able to tell from the uplift curve what the average revenue will be if

we follow the model and offer customers with the price that the model predicts will

generate higher revenue. This raises the necessity of an evaluation metric aligning

well with the problem objective directly.

The new evaluation metric is based on the introduction of a new random variable,

which can be calculated from recorded samples. When a randomized experiment is

conducted, treatments are assigned randomly and independently from the features.

Suppose the probability that a subject assigned to treatment t is pt. Note that

treatments are not necessarily evenly distributed. This particularly happens when

we would like to learn the treatment effect over a default one, and set aside a small

portion of the population exposed to new treatment while keep the majority not

30

affected by the experiment at all. We assume that pt > 0 for t = 0, ... , K to exclude

trivial cases. See Lemma 1 for the definition and the property.

Lemma 1. Given an uplift model h, define a new random variable

K

Z = Y x [{h(X) = t}I{T = t} (2.3)= Pt

where l[{} is the 0/1 indicator function. Then

E[Z] = E[YIT = h(X)].

Proof. The proof is straightforward using the law of total expectation.

E [Z]= : -E [Y x fh (X) t}I IT =t] P{T = t

KE ] E Y I{ h(X) = t] }|T =t]P{ = t}

= SE[Yh(X) =t,T =t]P{h(X)=t}t=o

= E[Ylh(X) = T]

The definition of z consists of K terms, where there is at most one term not equal

to 0 and the nonzero term can only exist when the assigned treatment matches the

predicted treatment. 2.4 gives a more explicit expression of the calculation of z from

an observed sample {x, t, y}.

0 if t h h(x),Zi (2.4)

y/pt if t =h(x).

Lemma 1 builds the relationship between the new variable and the expected re-

sponse our problem aims to optimize. This, together with the fact that the sample

31

average is an unbiased estimate of the expectation, leads to the following theorem.

Theorem 2. The sample average

N W

Z = N(2.5)i=1

is an unbiased estimate of E[YIT = h(X)].

To our knowledge, this is the first evaluation method applicable to evaluate uplift

models offline. Moreover, we can compute the confidence interval of which helps

to estimate the possible range of E[YIT = h(X)]. With the definition of Z, we are

able to evaluate model performance on an existing dataset obtained from a random-

ized experiment offline before applying the model to live experiment. This guidance

particularly helps us.

Being able to estimate expected response with training data, we introduce mod-

ified uplift curve, which provides an operating characteristic of an uplift model.

We compute the difference in expected response under the predicted optimal treat-

ment and the control for each subject given an uplift model, and then rank the test

subjects by the difference from high to low. The top p percent of the test subjects

are assigned to the predicted optimal treatment, and the rest to the control. The

expected response under this new assignment rule is then estimated with Eq. (2.5).

The resulting graph with horizontal axis as the percentage of population subject to

treatments and the vertical axis as the expected response is modified uplift curve.

The reason that we introduce such a curve is that in real world there usually

exists a certain level of risk if subjects are exposed to treatments, such as disruption

to customer experience, unexpected side effects, etc. As a result, the experimenter

wants to limit the percentage of population exposed to treatment while still obtaining

as much benefit from personalization as possible. We can select an operating point at

the modified uplift curve to achieve the balance between the percentage of population

exposed to treatment and the expected response according to practical needs.

32

2.2 The CTS Algorithm

Decision trees are used successfully in many diverse areas such as character recog-

nition, medical diagnosis, and speech recognition, etc. Its capability to break down

a complex decision-making process into a collection of simpler decisions makes it

powerful in performance and easy in interpretation. However, a single tree suffers

from overfitting as the predictions of a single tree are highly sensitive to noise in the

training set, thus Random Forest is put forward to reduce the variance without in-

creasing the bias [Breiman 2001]. The idea of random forest is that an aggregation of

many different decent-looking trees outperformed over a single highly-optimized tree

[Breiman 1984]. It applies the general technique of bootstrap sampling and feature

bagging, to tree learners. An ensemble of trees is built, where each tree is con-

structed using a random sample with replacement of the training set. And only a

random subset of features are considered for splitting at each step to select a split.

This bootstrapping procedure de-correlate the trees by showing them different train-

ing sets and more balanced selections of different features as splits, which helps reduce

variance and smooths sharp decision boundaries. The success of random forest has

been proven in theory and in practice [Delgado 2014]. We have also seen excellent

performance of tree ensembles in uplift modeling problem [Rzepakowski 2015].

We follow the tree ensemble methodology to generate a forest for uplift modeling.

We refer to it as CTS algorithm which stands for Contextual Treatment Selection.

With the ability to assign the subspace-wise optimal treatment, CTS would approxi-

mate the true point-wise treatment assignment. Two key elements to build a tree is

the splitting criterion and termination conditions. We will introduce how we design

them for the customization to uplift modeling problem.

A tree grows by recursive binary splitting from the roof node, where the roof node

represents the feature space Xd and each split creates two subspaces further down. At

a node corresponding to the feature space #, we should decide whether to do a split

and how to do the split if needed. Without the split, the highest expected response

33

is obtained by picking the best treatment for 4

P(#) = max E[Y X E 0, T = t] (2.6)t=1,...,K

The benefit of adding a split s that divides 4 into the left child-subspace i1 and

the right child-subspace 0r, is that we can select the best treatments for the child-

subspace correspondingly, which may not be the same. This added flexibility could

lead to increased expected response for 4 overall. The increase from this split is

AP(s) (2.7)

=P{X E # 1JX E #} max E[YIX E ,T = tl]t1 =1,...,K

+P{X Er#XCE I max E[YX E r,T=trtr=1,...,K

- max E[YIX E#, T= t] (2.8)t=1,...,K

The optimal split is selected to maximize this increase Ap at each step. The question

then comes down to estimations of each component in Alu(s).

The expected response after the split is a summation of the expected response from

each subspace weighted by the conditional probability of a subject falling into each

subspace given that it is in #. The weight is in consideration of imbalance between the

number of data points falling into two child-subspaces. To avoid redundant notations,

let #' stand for one of the child subspace 0i or Or. We use P(#'1#) to denote the

estimate for the conditional probability of a subject in #' given that it is in # ,and yt(4') the estimate for the conditional expected response in subspace #' under

treatment t.

The conditional probability is estimated using the sample fraction

i=1 l{x) C ' (2.9)The eI{x(e) E or}

The expected response in a subspace for each treatment is estimated using the

34

sample average

ZC> y(2)I[{x(i) E q'}li{t(0) = t} (.0yi= (Y) (2.10)ZWi II{x(i) E q'}11{t(i) = t}

However, this sample average estimation requires further investigation. With the tree

growing deeper and deeper, there are fewer and fewer samples in the feature space

corresponding to each node, which may lead to bigger variance and larger error for the

estimation. Thus we need some remedies to reduce variance and avoid the influence

from outliers. Let nt(#') be the number of samples in #' with treatment equal to t,

we instead estimate conditional expectation as follows

Qt(#) if nt(#') < min_ split,Qt (#') = (2.11)

-y(I{x)1 Y '} {t)=t}+Pt(+)nreg otherwise .X:i=ffx(')EO'}Ift0i)=t}+n reg

where minsplit and nreg are user-defined parameters.

The modification consist of two aspects. First, min_ split is a threshold such that

we inherit the estimation from the parent node if the number of samples for some

treatment is smaller than this threshold. Second, nreg is a regularity parameter

when estimating sample average to avoid outliers misleading. The larger the value

of nreg, the more importance it takes from the parent-subspace sample average

yt(#) as a shift to the child-subspace sample average. Based on our experiments,

it is usually helpful to set n_reg to a small positive integer. Note that Q(#') is

defined recursively-the effect of sample average estimation is passed down to all the

descendant nodes.

Summarizing the above, we estimate the increase in the expected response from

a candidate split s as below,

Ap(s) = P(#0 1I) x max yt(#i)t=1,..,K

+P(#,l) X max Qt(4,) - max Q(#). (2.12)t=1,...,K t=1,...,K

As to the termination condition when splits should be stopped, we need to take

35

into consideration the special structure of uplift modeling training data. In CTS, a

node is a terminal node if any of the following conditions is satisfied.

1. There does not exist a split that leads to non-negative gain in the estimated

expected response.

2. The number of samples is less than the user-defined parameter min split for

all treatments.

3. All the samples in the node have the same response value.

The first condition is similar to common decision trees, where the splits should not

be stopped until all possible splits will damage the performance of the current tree.

We allow a split with zero gain to be performed because a nonprofitable split for the

current step may lead to profitable splits in future steps. The second condition allows

us to split a node as long as there is at least one treatment containing enough samples.

When data distribution among treatments are not even, there may be cases where

there are enough data samples from one treatment, but not enough data samples

from another one. We should continue splits as the objective function could probably

increase owing to the treatment with enough data samples. This condition, together

with the formulation of conditional response expectation, allows a tree to grow to

a full extent while ensuring reliable estimates of the expected response. The third

condition is to avoid the split of pure node with the same responses. Without this

condition, we will continue splits on the pure node as long as there are enough data

samples, while it will not bring any actual gain to the objective function.

Combining the splitting criterion and termination conditions to construct each

single tree, we outline the CTS algorithm in Algorithm 1.

2.3 Experimental Evaluation

In this section, we first use a toy example to illustrate the difference between sepa-

rate model approach and our CTS algorithm, which highlights the importance of a

36

Algorithm 1 CTS - Contextual Treatment Selection

Input: training data SN, number of samples in each tree B (B < N), number of trees

ntree, number of variables to be considered for a split mtry (1 < mtry < d), the

minimum number of samples required for a split min-split, the regularity factor

n-reg

Training:

For n = 1 : ntree

1. Draw B samples from sN with replacement to create sn. Samples are drawn

proportionally from each treatment.

2. Build a tree from sn . At each node, we draw mtry coordinates at random,

then find the split with the largest increase in expected response among the

mtry coordinates as measured by the splitting criterion defined in Eq. (2.12).

3. The output of each tree is a partition of the feature space as represented

by the terminal nodes, and for each terminal node, the estimation of the

expected response under each treatment.

Prediction: Given a new point in the feature space, the predicted expected response

under a treatment is the average of the predictions from all the trees. The optimal

treatment is the one with the largest predicted expected response.

specialized algorithm for uplift modeling problem. We then compare the experimen-

tal results of CTS algorithm and other applicable uplift modeling methods on both

synthetic data and real world data. For synthetic data, we are able to analyze the

model performance from a more comprehensive perspective with knowing the true

data model. With the help of our industry collaborators, we test the model perfor-

mance on two large-scale real world datasets, and this is helpful to guide us how each

model works when we know little knowledge of the underlying data generation model.

2.3.1 Advantage of CTS Over SMA

Suppose we have a 2-D feature space X = (X1, X2 ). Both X1 and X2 follow the

uniform distribution over [0, 100]. There are two treatments. The response function

under both treatments are defined as below

37

U[0, xi] if T = 1,Y ~

U[0, x 1] + U[P, x 2 ]/10 - 2.5 if T = 2.

For this particular data model, the response under treatment 1 is independent of

X2 , while for treatment 2 there is a small uplift effect associated with X2 which is

neutralized by the overall drop (the -2.5 term). The response for each treatment

is determined to a large extent by feature X1, the individual-treatment feature. Yet,

which treatment is optimal is completely determined by X2, the cross-treatment fea-

ture. The expected response for the entire population is 25 for both treatments.

However, the expected response under treatment 1 is greater than that under treat-

ment 2 when x 2 < 50 and the other way around when x 2 > 50. Fig. 2-2 illustrates

the pattern of the optimal treatment. An oracle who knows the data generation pro-

cess can employ the optimal treatment as shown in Fig. 2-2 and obtain the optimal

expected response of 25.625.

100

75-

TreatmentC 50- 01

2

25-

0-

0 25 50 75 100x1

Figure 2-2: True decision boundary of optimal treatment on the 2-D example

The performance of Random Forest (RF) and CTS is illustrated in Fig. 2-3. The

horizontal axis shows the number of samples per treatment in the training data. The

samples are generated using the data model as described above. We test 6 different

38

data sizes per treatment, namely 2000, 4000, 8000, 10000, 20000, and 40000. For each

data size, we repeat the following procedure 10 times - generate a training data set

of prescribed size; train two models, one with RF and the other with CTS, with the

training set; compute the expected response after deploying each of the two models

using the true distribution. The expected response of RF and CTS plotted in Fig. 2-3

is the average of the 10 trials.

0N

- -n

0 A 00 200 00

X CD

nn s oracle-- CTS-y-random forest* .optimal single treatment

N

0 10000 20000 40000

number of training samples per treatment

Figure 2-3: Comparison of performance between different approaches on the 2-D example

As is shown in Fig. 2-3, while the performance of CTS and RF both improve

with the size of training data, CTS outperforms Random Forest at all the training

sizes tested. This demonstrates the benefit of modeling data from all treatments

together over separately. When building RF model separately for each treatment, X2

seems either completely irrelevant (for treatment 1) or insignificant (for treatment

2). However, by considering data from both treatment, CTS is able to capture the

weak, yet systematic uplift effect caused by X2. The difference between these two

39

modeling approaches becomes more obvious when one looks at the decision boundaries

generated by them. Fig. 2-4 illustrates the decision boundaries of the models trained

with RF and CTS when training size is 40000 per treatment.

RF decision boundary CTS decision boundary1001 011B i 101

75- 75-

Treatment Treatmntdo50- 50-

2 2

25 25

25 5 75 100 25 50 75 100X1 X1

Figure 2-4: Comparison of decision boundary between different approaches on the 2-Dexample

As we can see from Fig. 2-4, the decision boundary from CTS is much more similar

to the true decision boundary than that from RF. While RF decision boundary does

not possess a clear pattern, CTS decision boundary suggests a partition between

the upper halfspace and the lower halfspace. The numerous vertical lines present

in RF decision boundary is a sign that the model partitions the feature space more

frequently on X1 than on X2, which is precisely what is expected.

2.3.2 Synthetic Data

The second synthetic dataset we are considering is in a feature space of a fifty-

dimensional hyper-cube with length 10, i.e., Xd = [0, 10]50. The feature of each

dimension is uniformly distributed, i.e., Xd ~ U[0, 10 ], for d = 1, ... , 50. There are

four different treatments, T = 1, 2, 3, 4. The response under each treatment is defined

40

as below.

f(X) + U[0, aX1] + c if T = 1,

f(X)+U[0,aX 2]+E ifT=2,Y =<

f(X)+U[0,aX 3]+ ifT=3,

f(X)+U[0,aX 4]+( if T=4.

The response consists of three components. The first term f(X) is the systematic

dependence on the feature vector and is identical for all treatments. We choose f to

be a mixture of 50 exponential functions to reflect complex real-world scenarios.

f(Xi,.., Xo) (2.13)

50

=Z a' exp{-bjxii - - - b'11X 5o - Ciol},

where a', bj and c) are chosen randomly.

The second term U[0, aXt] represents treatment effect and is different across treat-

ments. We set a to be 0.4 which is roughly 5% of E[lf(X)I], where the treatment

effect is of a lower order of the magnitude of the main effect. The third term ( is zero-

mean Gaussian noise, i.e. c - N(0, o.2 ). . is set to 0.8 which is twice the amplitude

of the treatment effect.

Under this particular data model, the expected response is the same for all treat-

ments, i.e., E[YIT = t] = 5.18 for t = 1, 2, 3, 4. The expected response under the

optimal treatment rule E[YIT = h*(X)] is 5.79. 2

It is a multi-treatment uplift problem with continuous response, thus there are

5 methods applicable: CTS, Separate Model Approach with Random Forest (SMA-

RF), K-Nearest Neighbor (SMA-KNN), Support Vector Regressor with Radial Basis

Kernel (SMA-SVR), and AdaBoost (SMA-Ada). We develope a R package for CTS,

and use the implementation in scikit-learn for other algorithms. The performance of

2Exact values of data model parameters and datasets can be found at this Dropbox link https://www.dropbox. com/sh/sf7nu2uw8tcwreu/AAAhqQnaUpR5vCfxSsYsM4Tda?d1=0

41

- - Optimal - -- Single Trt - - SMA-ADA -(-SMA-RF

-O-SMA-KNN <>SMA-SVR -CTS5.8

5.70

5.6

5.5

5.4

5.3

--- . . . . . . . . --- -- -- --------- ------- --- --------

5.10 5000 10000 15000 20000 25000 30000

Training Data Size (samples per treatment)

Figure 2-5: Averaged expected response of different algorithms on the synthetic data. Theerror bars are 95% margin of error computed with results from 10 training datasets. Foreach data size, all algorithms are trained with the same 10 datasets.

these 5 methods is compared on different sizes of training data. We train the models

on training data with 500, 2000, 4000, 8000, 16000, and 32000 samples per treatment,

and evaluate each model using Monte Carlo simulation on the true data model. This

is done 10 times independently to compute the average result and the error margin

as well. To ensure consistency in comparison, all the methods are trained with the

same 10 datasets for each data size. All models are tuned carefully with validation

or cross-validation, with detail of the parameter selection procedure specific to each

algorithm found in the Appendix A.

We plot the averaged expected response of the 5 methods in Fig. 2-5, with the

vertical bars showing the 95% margin of error computed with results from 10 indepen-

dent trials. The single treatment expected response (short dash line without markers)

and the optimal expected response (long dash line without markers) are included as

42

a reference.

Fig. 2-5 shows that with 2000 samples per treatment, the performance of CTS

exceeds that of the separate model approach, and the advantage continues to grow as

the training size increases. CTS can achieve almost the oracle performance with 32000

samples per treatment. Note the significant difference between CTS and Random

Forest. They are both tree-based algorithms, with the only essential difference in

the splitting criterion and termination conditions, which makes their results far from

similar. Among the algorithms for the separate model approach, support vector

regressor performs the best. This is probably because the underlying data model

is a mixture exponentials, which is consistent with the assumption of SVR using

radial basis kernels. In contrast, other separate model approach algorithms work

only slightly better than assigning a fixed treatment even with the largest training

size. Thus the separate model approach can only achieve uplift if the model trained

for each treatment is accurate enough, while model mis-specification will deteriorate

the performance seriously.

2.3.3 Priority Boarding Data

We present our work with an European airline as an example of application for ran-

domized experiment and uplift modeling in the online pricing problem. Airlines led

the business world in their development of variable pricing of airfares, embracing the

truism that all customers are different and have different needs. Our collaborator

airline changed the price of priority boarding from E 5 and E 7 at some point and

provided us a dataset containing passengers' purchase records under the two price

levels. We made some initial analysis and found that there indeed exists customer

heterogeneity in price sensitivity. Fig 2-6 shows the change in revenue per passenger

with respect to passengers' feature. For illustration purpose, we only show two fea-

tures: days between booking and departure and flight fare. The red areas represent

customers preferring high price, green areas represent customers preferring low price,

and yellow areas represent customers neural to these two prices. This confirms that

the purchasing behavior of passengers varies significantly and it can be beneficial to

43

customize price offering based on certain characteristics.

a) 0.4

- 0.2

- m~E0.0_A 0.4

M -0.8

Days between Booking and Departure

# of passengers who buyColor = low pnce * # of passengers in total low price

. # of passengers who buy# of passengers in totall high price

Figure 2-6: Change in revenue per passenger with respect to passenger features

We derive a total of 9 features from the raw transaction records: the origin station,

the origin-destination pair, the departure weekday, the arrival weekday, the number

of days between flight booking and departure, flight fare, flight fare per passenger,

flight fare per passenger per mile, and the group size. The data is randomly split into

the training set (225,000 samples per treatment) and the test set (75,000 samples per

treatment). We test six methods: the separate model approach with Random Forest

(SMA-RF), Support Vector Machine (SMA-SVM), Adaboost (SMA-Ada), K-Nearest

Neighbors (SMA-KNN), as well as the uplift Random Forest method implemented in

[Guelman 2014], and CTS. In this dataset, the response is actually discrete: E 5/E 7

if there is a purchase and 0 if not. Thus we model it as binary classification problem

for the first 5 methods: 1 for purchase and 0 for non-purchase. The expected revenue

is the product of purchase probability and the corresponding price. As CTS can

handle discrete response, we use the revenue as response directly. Cross-validation

44

j 0.60

0.52W.5 0.490049

05 0.44 0.450.42 0.42 0.42 0.42

0.40 - -

0.30 -

0.20 - - -- - - -

0.10 -

0.00

Figure 2-7: Expected revenue per passenger from priority boarding based on differentmodels.

is conducted to select the optimal parameter. See Appendix for detail on parameter

tuning.

The expected revenue per passenger on the test set is plotted in Fig. 2-7. If only

one price is offered to the whole population, revenue per passenger overall is the same

- 4 0.42. The figure shows that customized pricing models can significantly increase

the revenue from 60.42 to C0.52. This 20% uplift could lead to a remarkable gain in

profit for an airline with tens of millions of scheduled passengers per year. We again

see the importance of specialized algorithms for uplift modeling from this test case.

We compare the modified uplift curves for the 6 methods in Fig. 2-8. CTS performs

the best at all operating points of population percentage. The upliftRF algorithm

ranks next and outperforms the separate model approach. It is worth to note the

performance of SMA-RF: the sharp rise at the beginning of the curve and the sharp

45

-o-CTS -- UpliftRF -SMA-RFSMA-Ada -- SMA-SVM .... SMA-KNN

o 0.54

(U

cO 0.52

0.5CU

0.48

0.46

00.0.44

CU

0.4 -- ---

. 0 0.2 0.4 0.6 0.8 1

Population Percentage Subject to Treatment

Figure 2-8: Modified uplift curves of different algorithms for the priority boarding data.

decline at the end show that it is very accurate when identifying subpopulation for

which the treatment is highly beneficial or extremely harmful; while the (almost)

straight line for the middle part reflects its failure to predict the treatment effect for

the vast majority. Similar with the poor performance in expected revenue, SMA-SVM

and SMA-KNN rank lowest. This is probably because the distance calculation cannot

handle handling categorical variables well and the mis-specification of SVM.

Though revenue maximization is the most focused topic in dynamic pricing, we

also care about how the pricing policy will affect users. A user has an objective

measure of her well-being when she takes an action, and this measure is called utility

level in economics. The standard way to view utility is as representing an individual's

choice. The user's action can reveal her preference, and utility is a mathematical

construction for modeling choice and preference [Wong 1978]. In the dynamic pricing

problem, when a user makes a purchase with an offered price, she is willing to pay

46

for that price to get priority boarding service and thus she receives positive utility;

on the contrary, she receives zero utility if she does not make a purchase. To consider

the benefit of both seller (airline company) and buyer (passenger), we will take the

summation of revenue and utility as the response.

It usually requires empirical behavioral studies to learn the utility function which

is not the topic of this thesis, therefore we assume that a user gets utility 5 and 2.5

if she makes a purchase with an offered price of 5 and 7 respectively, and gets utility

0 if she does not make a purchase. Note that the airline company is the decision-

maker and decides which price will be offered to which user, and the user only has the

right to decide whether she will accept the offered price but not to choose between

two price levels. Under this formulation, we follow the previous approach to train

different models: SMA-RF, upliftRF, and CTS. The expected summed response per

passenger on the test set are shown in Fig. 2-9. Not surprisingly, CTS outperforms

other customized pricing policy and single pricing policy. We take a further look

at what revenue and user utility look like respectively under these pricing policies

in Fig. 2-10. We can see that CTS achieve best results in both revenue and user

utility though we have not optimized them separately during the model training.

This example shows that customized pricing is a win-win choice for both seller and

buyer as it is to capture the market's consumer surplus and transfer this surplus to

sellers for their benefit, while it does not hurt users' interest with delicate modeling

approach.

2.3.4 Seat Reservation Data

Following the success of customized pricing in priority boarding, we started our col-

laboration with another airline company and applied our model in seat reservation

pricing.

We conducted a well-designed randomized experiment with four price levels -

low (L), medium low (ML), medium high (MH) and high (H). The response is the

revenue from each transaction. We derive features from the raw transaction records,

including the booking hour, the booking weekday, the travel hour, the travel weekday,

47

1.20

1.00

0.80

0.60

0.890.84

--0.96-1.01

0.67

0.40 +

0.20

0.00 --5C 7 f SMA-RF upliftRF CTS

Figure 2-9: Expected summed response per passenger from priority boarding with theconsideration of user utility.

48

U revenue U utility

0.420.42 0.42

-----

0.450.44

).15

7 f SMA-RF upliftRF

Figure 2-10: Expected response per passenger from priority boarding with the considera-tion of user utility.

49

0

W

nAQ0.52

0.60

0.50 -

0.40

0.30

0.20

0.10

5 E CTS

).47 ---

the number of days between the last login, the next flight, the fare class, the zone

code (all flight routes are divided into 3 zones, and prices are set for different zones),

whether the passenger returns to the website after flight ticket purchase, the journey

travel time, the segment travel time, the number of passengers, the quantity available,

etc.

The experiment collects a dataset with 213,488, 176,637, 160,576 and 214,515

samples for each treatment. We use 50% for training, 30% for validation, and 20%

for test. During the experiment, prices on different routes are different under the same

level for practical needs, and one transaction may include the purchase of multiple

seats, so we need to model the response as a continuous variable. Thus Uplift random

forest is not applicable as it can not handle continuous responses. Among the separate

model approach, we only test SMA-RF because it far surpasses other models in the

previous example of priority boarding data.

We compare the average revenue per customer with different pricing models in

Fig. 2-11. By offering a single price in the entire market, the maximum expected

revenue is $1.87 per passenger with price level H. With personalizing treatment as-

signment, the expected revenue can increase to $2.37 with SMA-RF and $3.35 with

CTS. Our specialized algorithm for uplift modeling improves the airline's revenue by

as much as 80The modified uplift curves of SMA-RF and CTS is shown in Fig. 2-12.

We can see that CTS outperforms SMA-RF at every operating point of population

percentage.

50

4.00

3.50 - -

1

3.00------- 2.50 ------------------------ ------- - -

2.00 -------------------------- 871.64 1.70

1.50 1A2

1.00 -- --

04 0.50 - -- - ------

0.00L ML MH H SMA-RF CTS

Figure 2-11: Expected revenue per passenger from seat reservation when applying differentpricing models.

51

o CTS -SMA-RFO 2.3

2.2

2.1

2

1.9

1.8

1.7

1.6

1.5

1.4

1.3 --0 0.2 0.4 0.6 0.8 1

Population Percentage Subject to Treatment

Figure 2-12: Modified uplift curves of SMA-RF and CTS on the seat reservation data.

2.4 Summary

In this chapter, we present a complete framework for uplift modeling problem. In

Section 2.1.1, we first describe the mathematical formulation of uplift modeling prob-

lems. The definition of an objective function points out how to design uplift modeling

algorithms aligning with the problem objective. We then put forward an evaluation

metric that can measure uplift algorithms with randomized experiment data offline

in Section 2.1.2. With this evaluation metric, in Section 2.2, we design an ensemble

tree-based algorithm with specialized splitting criterion to maximize the quantity of

interest directly and termination conditions for the special structure of uplift model-

ing problem and training data. Experimental results on synthetic data and real-world

data in Section 2.3 showed that our algorithm outperform separate model approach

and existing uplift modeling algorithms.

In Section 2.3.3, we applied our algorithm to dynamic pricing problem and it

achieved superior performance, not only increasing airline company's revenue but also

52

not hurting user's utility. This is particularly instructive to the airline industry. Since

the deregulation in 1978, airlines have the freedom to set the prices by themselves

and this industry became more and more competitive. High fixed cost because of

expensive jets, large labor forces, and skyrocketing fuel costs, have caused many

legacy airlines to struggle, even at one point filed for bankruptcy. However, closing

down a large unprofitable airline would be a politically unpalatable decision as various

stakeholders cannot afford it. It involves the loss of thousands of jobs, inconvenience

to hundreds of thousands of travelers, and millions in losses for the airline's creditors

[Cornia 20111. Instead of cut-throat pricing in the entire market, customized pricing

is a more sustainable strategy. Charging higher prices to customers with a lower

price elasticity, or equivalently with a higher willingness-to-pay, airlines are able to

increase the average markup of prices, and thus increase their profits. Our algorithm

provides a systematic way to identify and separate customers with different price

elasticities, though it requires further work to accurately evaluate user utility and

how to incorporate user utility with revenue is worth some more discussions.

53

Chapter 3

Theoretical Analysis

3.1 Introduction

In addition to the performance comparison on different datasets, we would like to

further analyze the algorithm on the theoretical side. First of all, the consistency

analysis will give a basic theoretical guarantee of the algorithm. Moreover, in the

general context of treatment selection problem, we are interested in seeing how the

algorithm works compared to an oracle who knows the true optimal treatment for

every subject. The gap between the expected response achieved by the oracle and

the one achieved by a treatment assignment model without knowing response func-

tion, denoted as regret, has been studied extensively in multi-armed bandit (MAB)

problem.

3.1.1 Multi-Armed Bandit Problem

A multi-armed bandit problem can be considered as an online experiment of treatment

selection problem. With subjects coming sequentially, the decision maker chooses one

of the possible actions (treatments) and observe its response at each time step. The

goal is to maximize the sum of responses over all time steps. With unknown response

function for different treatments, the decision maker faces a exploration-exploitation

tradeoff. Subjects are split into different treatment according to the designed policy

55

at the beginning (exploration phase), and a currently best performing treatment is

selected until you are confident of its performance for the rest of time (exploitation

phase).

In context-free multi-armed bandit problems, first bounds on the regret for the

model with finite action space were obtained in the classic paper by [Lai 19851, which

introduced the technique of upper confidence bounds for the asymptotic analysis of

regret. Regret optimal algorithms in the non-stochastic bandit problems is provided in

[Auer 20031. Continuous treatment spaces and response functions satisfying Lipschitz

condition were studied in [Kleinberg 2005a] [Kleinberg 2005b].

In recent years, multi-armed bandit problem with side information attracts more

interest. [Wang 2005a] introduced contextual stochastic bandit problem. It is gener-

alized to the multivariate linear case by [Rusmevichientong 2010], and to the multi-

variate and nonparametric case by [Perchet 2013]. [Alina 2011] present an algorithm

with comparable regret guarantee to those in standard supervise learning setting.

Though multi-armed bandit approach can achieve near-optimal performance in

theory, it suffers from the uncertainty of treatment allocation process and it cannot

be evaluated offline because of the interactive nature of the problem. Meanwhile, the

advances in operations management enables companies to quickly collect and process

large amounts of data, which provides abundant resources for offline algorithms like

CTS.

3.1.2 Tree-Based Algorithm

We have seen superior performance of tree-based algorithms in Chapter 2 over other

parametric/nonparametric methods, and tree-based models have also proven to be

reliable predictive algorithms in many application areas. However, the statistical

properties of tree-based models are not yet fully understood. [Wager 2017] develops

a causal forest for estimating heterogeneous treatment effects, which is pointwise con-

sistent for the true treatment effect and has an asymptotically Gaussian and centered

sampling distribution. This is the first set of results of random forest in the context

of personalized treatment selection. Results in [Wager 20171, though limited to sin-

56

gle treatment (treatment vs control), show that tree-based models are substantially

more powerful than classical methods based on nearest-neighbor matching not only

in experimental performance but also in theoretical properties.

In contrast to the most popular classification and regression tree (CART), which

grows a full tree by orthogonally splitting the axes at locally optimal splitting points,

there is a new family of decision trees, dyadic decision trees (DDT), which attain

nearly optimal (in a minimax sense) rates of convergence for a broad range of classi-

fication problems [Gey 2015]. DDT are restricted only axis-orthogonal dyadic splits,

i.e. each dimension can only be split at its midpoint. DDTs can approximate complex

decision boundaries, and the restriction to dyadic splits makes it possible to globally

optimize a complexity penalized empirical risk criterion [Scott 20051. We apply dyadic

split to our CTS algorithm and prove that this modified algorithm, able to handle

multiple treatments, is asymptotically optimal under mild regularity conditions.

3.2 Consistency Analysis

An unbiased contextual treatment selection (UCTS) algorithm is put forward in

[Zhao 2017] and proven to be consistent under some mild assumptions. In CTS

algorithm, an exhaustive search is conducted to find the best splitting point at each

step. This approach is locally optimal and easily prone to bias from outliers. This

issue is exaggerated in our CTS algorithm, as the increase in expected response is

calculated using data samples from all treatment and the selection of the splitting

point will be affected by outliers from any treatment. The effects of extreme values

will be accumulated with successive splits grouping similar extreme values and thus

more and more bias are introduced.

In order to reduce the bias, UCTS uses different data sets for model training and

response prediction. It randomly splits the training set SN into the approximation

set SA and the estimation set SE. The size of approximation set and estimation set

is determined by a user-defined parameter rhoE (0, 1): for each treatment, a fraction

rho of the examples in SN is sampled to build SA; and the rest belongs to SE

57

Algorithm 2 Unbiased Contextual Treatment Selection (UCTS)

Input: training data SN, fraction of data used for partition generation rho, number of

trees ntree, number of variables to be considered for a split mtry (1 < mtry d),

the minimum number of samples required for a split min-split, the regularity

factor n.reg, the tree balance factor alpha

T'raining:

For n = 1 : ntree

1. Draw round(rho x N) samples from SN to create the approximation set SA.

Samples are drawn proportionally from each treatment. The estimation set

sE = N _ SA.

2. Build a tree from SA. At each step of the growing process, one coordinate

is drawn at random with probability pi, or mtry coordinates are drawn at

random with probability 1- pi. We perform the split that has the largest 6P

among all the alpha-regular splits on the selected coordinate or coordinates.

The output of this step is the set of nodes 4D of the tree.

3. With SE we estimate the conditional expectation under each treatment for

all the nodes in <b as described in Section 3.1.3.

Prediction: Given a test point, the predicted expected response under a treatment

is the average of the predictions from all the trees. The optimal treatment is the

one with the largest predicted expected response.

Each tree is built using the splitting criteria and terminations conditions described

in Section 2.2 with data in SA. Let 4 be a node in the tree, we denote the data

samples in SE that fall into # with treatment t as SE(q, t) . If SE(0, t) is not empty,

then the estimation of conditional expected response in 4 under treatment t is the

sample average; otherwise, the estimation is inherited from its parent node. With

estimations for the root node calculated initially, we can obtain the estimations for

all nodes either using sample average or traversing back level by level. The UCTS

algorithm is outlined in Algorithm 2.

To recall the problem formulation in Section 2.1.1, in uplift modeling, the quantity

of interest is the conditional expectation for an observed subject x and an assigned

58

treatment t

p(x, t) E[YIX = x, T = t] (3.1)

Suppose we have a dataset SN = { (cxw, ti, y( ), i = 1,..., N } containing N joint

realizations of (X, T, Y) from a randomized experiment. The objective of uplift

modeling is to learn a mapping hN from the feature space to the treatment space:

Xd _ {0, 1, . . . , K} such that the expected response conditional on this mapping

E[ YIT = h(X)] is maximized. The optimal treatment assignment rule h* satisfies

the point-wise optimal condition, i.e., Vx E Xd,

h*(x) E arg max E[YX = x, T = t]. (3.2)t=l,...,K

Now we define the consistency of the uplift algorithm as the following.

Definition 3.2.1. An uplift algorithm is L2 Consistent if

E{p[X, h*((X))] - 1 [X, hN(X)]} 2 -+ 0 as N -+ oc (3.3)

where the expectation is taken over both the test example X and the training set SN-

The consistency result is derived based on the following assumptions.

1. Features are uniformly distributed in the d-dimensional unit hypercube, i.e.,

X ~ U[0, 1]d.

2. The response is bounded, i.e. there exists Cy > 0 such that IY < Cy.

3. The conditional expectation function p(x t) satisfies Lipschitz condition, i.e.,

there exists a constant CL > 0 such that Vx1 , x 2 , Vt E {1, ... , K}

I P(X1, t) - P (X2, t)| I CL IX1 - X21 (3.4)

4. The parameters alpha and min-split are chosen properly with n such that

limN o - = 0 and limNoo 1nN = 0, because a UCTS tree is a (a, k, )-regulart ,k

treelwith ce=alpha, l=min...split and k=alpha. min-.split.

59

If the above assumptions are satisfied, then a treatment selection rule hN con-

structed by UCTS algorithm is L2 consistent. Detailed proof can be found in [Zhao 2017].

The intuition is quite straightforward. We can properly tune parameter min-split

such that limNo = 0, thus the dimension of a leaf node vanishes as well as the

within-node variance in response. And the leaf node response can be estimated to an

arbitrary accuracy if k -+ 00 as N -- oc.

We take a further look at the difference between CTS and UCTS on a simple

2-dimensional example. The first feature X1 is continuous and uniformly distributed

between 0 and 100, i.e., X1 - U[0, 100]. The second feature X2 is discrete with values

A, B, C and each value occurs with probability 1/3. There are two treatments and

the response under each treatment is defined as below.

If T = 1, Y ~ U[0, X1 ]

0.8 x U[0, X1 ] + 5 ifX 2 = B,If T = 2, Y ~

1.2 x U[0, X 1] - 5 ifX 2 = A or C

The optimal treatment rule for this data model is plotted in Fig. 3-1. The

vertical boundary in the middle is located at X1 = 50. We use different ranges along

X2 axis to represent different values for illustration purpose. In Fig. 3-1, we can

see a sharp change around X1 = 50, while it is worth to note that this is because

discrete values of X2 are plotted like continuous and the actual different between

treatments changes smoothly with X1 for each value of X2 . Therefore it would be

more challenging to identify the correct treatment around X1 = 50. In addition, the

variance of response grows quadratically with X1, therefore the abilities to identify

optimal treatment should be more different when X1 is large because CTS is more

susceptible to extreme values than UCTS.

We compare the decision boundary reconstructed by the two algorithms on three

'An uplift tree is (a, k, 1)-regular for some 0 < a < 0.5 if each split leaves at least a fraction aof the available training examples on each side of the split and each leaf node contains at least ktraining examples for some k E N. In each leaf node, there are at most 1 training examples for eachtreatment, with 1 E N and 1 > 2k.

60

CBestTreatment

X 2 B

A

0 25 50 75 100X1

Figure 3-1: The optimal treatment rule for the 2D example. The vertical boundary alongX 1 axis is located at X1 = 50. The vertical axis X2 is a discrete variable.

different training sets in Fig. 3-2. It shows that the decision boundary generated by

UCTS is much smoother than that by CTS for all training sets. This phenomenon is

more significant on the right side of each plot when the variance of response is higher

and there are more extreme values, which is in line with our expectation.

Seeing better performance in consistency, we would like to know how UCTS works

in prediction performance. We train UCTS models and CTS models using 50 training

sets and calculate the expected response under the true data model. The average ex-

pected response and the 95% confidence interval are plotted in Fig. 3-3. It shows that

UCTS is fully competitive with CTS, without sacrificing performance for smoothness.

3.3 Asymptotic Properties

Here we prove the convergence rate of a simplified version of algorithm CTS. Al-

gorithm CTS.0, as outlined below, partitions the feature space in a deterministic

and mostly uniform manner, which eases the theoretical analysis of its convergence

61

UCTS CTS

Training Set 1

II IHI

Training Set 2

Training Set 3

Figure 3-2: The treatment rule reconstructed by UCTS and CTS for the 2D example.Plots on each are generated from the same training set. The horizontal axis is feature X1and the vertical axis feature X2. The labels and ticks of the axes are the same as those inFig. 3-1 and omitted here for simplicity.

62

26.50 -r-

26.00 - -----

25.50

25.00

24.50 -UCTS CTS

Figure 3-3: Average expected response under UCTS and CTS models for the 2D examplecomputed from 50 training sets with 95% confidence interval shown as the error bar.

property, while also provides some guidance on CTS.

If we know the conditional expectation E[YIX = x, T = k] = P(x, k) for any point

x in the feature space and for any treatment, we can partition the feature space Xd

into K subspaces ) 1, ., K such that

VX E )k, k E arg max p(x, k).k'=1,...,K

It is easy to see that we can achieve the Oracle performance by simply assigning 4k

to k for k = 1, ... , K. The challenge is that the true boundaries of (1,..., 4 K can be

arbitrarily complicated for practical problems. What is needed is a tractable way

to approximate these partitions. One option is discretization. We can continuously

split the feature space using coordinate-perpendicular cuts until the entire feature

space is divided into a set of small "boxes" and assign each box to the empirically

best treatment. Intuitively, the smaller the boxes are, the better they are able to

approximate the true decision boundary. However, since we need to infer the optimal

treatment from samples, and smaller boxes means fewer samples in each box, we

63

want to restrict the size of the boxes so that there are sufficient samples in most

boxes to make correct assignments. Fig. 3-4 shows how discretization error comes

from coordinate-perpendicular cuts. Understanding the counteraction between the

error from discretization and the error from misassignment will give us insights on

how to design an algorithm to obtain the small overall error. This is precisely the

goal of this section.

discretization error dependson the diameter of the bcoc

discretization error is 0

Figure 3-4: Illustration of dicretization error in 2-D space. The diagonal curve representsthe true decision boundary, with left bottom part assigned to treatment yellow and rightupper part assigned to treatment blue.

Section 3.3.1 describes a simple algorithm based on discretization named CTS.O.

In Section 3.3.2 we derive upper bounds for both the discretization error and the

misassignment error of CTS.O. We prove that, under mild regularity conditions, the

expected difference in performance between CTS.0 and an optimal treatment rule h*

approaches 0 as the number of samples goes to infinity, and the rate of convergence

is bounded above by O(N- (+ ), where N is the number of samples and d the

64

X2

Xd

F7I F7 s 0607 48 09 4010

X1

Figure 3-5: An example of feature space partition and treatment assignment for Algorithm

CTS.0 when Xd = [0, 112 and M = 10. Subspace #1 is assigned to treatment yellow and

subspace 02 to an arbitrary treatment.

dimension of the feature space.

3.3.1 Algorithm CTS.0

We first define some notations to facilitate the description of the algorithm. Given a

data set SN and a subspace # c Xd where Xd is a bounded subspace of Rd., define

N

fi(#, k) = I{x(' E ), t0) = k}, (3.5)

N

9(#, k) = y(' e x(E), t(0 = k}. (3.6)

If Tz(q, k) > 0, define

9(0, k) k) (3.7)

In other words, ft(#, k) is the number of samples in subspace # with treatment equal

to k, and 9(0, k) is the average response of these samples.

CTS.0 is outlined in Algorithm 3. It contains two phases. Phase 1 divides the

feature space into M subspaces through repeated partitioning at the midpoint. Phase

2 assigns each subspace to an empirically best treatment-based on the training data.

Fig. 3-5 shows an example of the two phases for M = 10 in 2D feature space.

65

009. S

0*** S

SS*0 *

.0,

4~i 0 0

Algorithm 3 CTS.0

Input: feature space Xd, number of partitions M, data SN

Phase 1 - Partitioning feature space

1: <D = {Xd}

2: dim = 1

3: for m = 1: [log2 MJ do

4: for every element of 4D do

5: replace it with two subspaces created by splitting it equally at dim-th

dimension

6: end for

7: dim = (dim mod d) + 1

8: end for

9: dim = (dim mod d) + 1

10: for j = 1: (M - 2[log2 MJ) do

11: replace the jth element in D with two subspaces created by splitting it equally

at dim-th dimension

12: end for

Phase 2 - Assigning partitions to treatments

for m=1:M do

if ii(m, k) > 0 for k = 1,...,Kthen

k# = arg maxk=1,...,K{ (Om, k)}, breaking ties randomly if optimal solution

is not unique

else

17: k# = 0

18: end if

19: end for

Output: The treatment rule

k# if k # 0,

1 ifk#=O

66

13:

14:

15:

16:

h#. Vx E Om, m = 1, ... , M,

h# (x) =

3.3.2 Upper Bound on Expected Regret

In this section we aim to understand how fast the performance of a treatment rule

trained with CTS.0 approaches the Oracle performance as the training size increases.

Let h# denote a treatment rule generated with CTS.0. Define the Expected Regret as

E max [pi(X, k)] - p(X, h#(X))} (3.8)Sk=1,...,KI

which is the difference between the expected response under the optimal treatment

rule and that under a treatment rule generated by CTS.0 with training set SN. Note

that the expectation is taken over both the training set SN and the test point X.

In order to bound the expected regret we need to make a few assumptions.

1. We assume the feature space Xd is a subspace of Rd and there exist Ru > RL > 0

such that

RL < max dzi - X'1 < Ru, i = 1, ... d.

2. For a subspace # C Xd, denote as fx (#) the probability that X E #. We assume

there exists a p > 0 such that f(#) > p -Volume(#) for all # C RE.

3. The response Y is bounded, i.e. there exists Ymax > 0 such that IYI Ymax.

4. We assume that the conditional expectation p(x, k) = E[YIX = x, T = k]

satisfies the Lipschitz condition, i.e., there exists a CL > 0 such that Vx, x' E

Xd, Vk E {1, ... , K},

Ip(x, k) - i(x', k)I CL Ix-x'.

5. We assume the N observations in SN are independent from each other. We also

assume the treatment values follow the discrete uniform distribution on 1,...,K

and are independent from the features.

Theorem 3. If the above assumptions are satisfied, then for any given number of

67

partitions M, the expected regret of CTS.0 is bounded above as following,

E max [it(X, k)] -p (X, h#(X))}

< 4CLRd-M--'Discretization Error

1 1 pRdKd )N+ 8K2Ymaxe MN-2+ 2YmaxK 1- L/2dK(3.9)

M ,1Misassignment Error

It is clear from Eq. (3.9) that the bound on discretization error decreases as the

number of partitions increases. The bound on misassignment error, as expected,

increases with the number of partitions. The following corollary explains how we can

achieve the asymptotically tightest bound by setting a proper value for M.

d

Corollary 1. Under the same conditions as Theorem 3, by setting M = round (N 2(d+1),

we have

E max [yp(X, k)] - p(X, h#(X)) }= o(N - U(d ).

The significance of Corollary 1 is twofold. First, it proves that the performance

difference between an optimal treatment rule and a treatment rule constructed from

data can approach zero as the number of samples increases to infinity. Second, the

rate of convergence can not be slower than O(N- 2(d 1)). The dependence of d in the

upper bound emphasizes the necessity of feature selection. While the performance

of the optimal treatment rule improves with the number of features in the model, it

also requires exponentially more data to approximate the optimal treatment rule to a

certain degree. Choosing the appropriate complexity of the model and selecting the

most powerful features are critical to achieving desirable performance.

The rest of this section is devoted to the proof of Theorem 3 which starts with

three facilitating lemmas.

Lemma 4 describes how the size of the partitions of the feature space decreases

with the number of partitions.

Lemma 4. Let Xd be a d-dimensional feature space with all dimensions bounded

above by Ru and below by RL. Let <D = {#1,... , be the partition of Xd pro-

68

duced by phase 1 of Algorithm CTS.0. Then, Vem E 4, 0m is bounded above in all

dimensions by 2RuM-d and below by 1RLM

Proof. Proof is straightforward and therefore omitted.

Lemma 5 proves that the performance difference between a point-wise optimal

treatment rule and a subspace-wise optimal treatment rule is proportional to the size

of the subspace.

Lemma 5. Let # be a subspace of Xd with all dimensions bounded above by R.

Denote the conditional expectation of Y as It(#, k) = E[Y IX C 0, T = k]. Let

k* E arg maxk=l,...,K[I(#, k)]. If p(.) satisfies the Lipschitz condition on #, i.e., there

exists some CL > 0 such that Vx, x' E q, Vk E {1, ... , K}

I|p(x, k) - p(x', k) CLIx - x' ,

then we have

E max [[(X, k)] - [t(X, k*) I X 2CLd R.

Proof. Vx E 4, let k+ E arg maxk=1,...,K[I(x, k)]. By the Lipschitz condition, we have

, k*) - CLd"'R < p(x, k*) < p(#, k*) + CLd'R,

p(#, k+) - CLd"IR < p(x, k+) < I(q, k+) + CLdR,

and therefore

p(x, k+) - tt(x, k*)

< p(#, k+) + CLdAR - [t(#, k*) + CLdAR

2CLdAR.

69

Because the bound in the inequality above dose not depend on x, we have

E max [p(X, k)] - p(X, k*) I X C }

< E{2CdIR | X E #} = 2CLd R.

Lemma 6 states that the probability of selecting a sub-optimal treatment for a

subspace decreases exponentially with the number of samples.

Lemma 6. Let # be a subspace of Xd with all dimensions bounded above by R. Let

k* E arg maxk=1,...,K [(, k). Given dataset SN, let k# be the assigned treatment of #

as determined by phase 2 of Algorithm CTS.0. Then Vk' E {1,. . . , K}\ arg maxk=1,...,K[J, k)],

we have

Pr{k# = k'} < 2exp{ 32 f 2 ()N

where A = p(#, k*) - p(, k').

Proof. Let a = 1 [p(, k') + t(#, k*)].

Pr{ k# = k' }

< Pr{ (# k') -Ti(O, k*) ! 9(0, k*) -Ti(#, k')}

= 1 - Pr{ 9(0, k') -T(#, k*) < ;(#, k*) -i(#, k') }

< 1 - Pr{ 9(#, k') - an(#, k') < 0,

9(#, k*) - ail(o, k*) > 0 }

= Pr{(#, k') - ah(#, k') > 0

or 9(0, k*) - aii(#, k*) 0}

< Pr{9(0, k') - ac(#, k') 0 }

+ Pr{(#, k*) - aii(, k*) < 0}

70

For i = 1, ... , N, define

Zi = YI - 1{X, E 0, T = k'} - a{Xi G $,X T = k'}.

It is simple to show that E[Zi] = - K f () and Zi is bounded between [2Ymax, 2Ymax].

Because Z1, ... , ZN are independent, we can apply Hoeffding's Inequality and

Pr{(#, k') - ah(#, k') > 0}N ,

=Pr{ Z' > 0 }i=1

N

=Pr{ Z,A

- [-- f(O)N]2K

A> -f(O5)N}

_2K

2( L)2f 2(O)N 2< exp{ - (Y 21N -(4Ymar)2

Ayf2 (q)N= exp 32K2yax

In similar ways we can prove

Pr{ (0, k*) - ah(#, k*) O} 0 exp - 32K2yO

Combine these results we have

Pr{k# = k'} < 2exp { - 2f2(O)N32K2 V a}

El

Proof of Theorem 3

Proof. Assuming M and Xd are given, the partition 4 generated by Phase 1 of Algo-

rithm CTS.0 is deterministic. For m = 1,..., M, let k* E arg maxk=l,...,K Y(0n, k).

71

We can decompose the expected regret into two parts.

E max [p(X, k)] -p (X, h#(X))k=1,...,K

M

= E{ max [p(X, k)]M=1 k=,.K

M

= E{ max [p(X, k)]m=1 k=,.K

M

+ E{ip(X, k*) -

- part 1 + part 2

- p(X, k#)IX E OmIf(0m)

- p(X, k*) IX E Om If(0m)

p(X, k#) IX E OmIf(0m)

Part 1 is the expected difference in response from using point-wise optimal treatment

and using subspace-wise optimal treatment. Combining Lemma 4 and Lemma 5 we

can bound part 1 as below

M

part 1 < E 2CLd- * 2RUM- 'f(Om)m= 1

=4CLRUdIM-d.

Part 2 is the expected regret from assigning subspaces to sub-optimal treatments. We

need Lemma 6 to bound Part 2. For m = 1, ... , M, let Bm = arg maXk=1,...,K pm, k)

72

and Am(k) = p(m, k* ) - p(m, k).

E{p(X, k*) - p(X, k#) |X C #m}

- [(#m,k)-(Om, k')] Pr{k# = k'}k'gBm

<2YmaxPr{ k# = 0} + Ek'gBmU{O

<2YmaxK[1 - f (OM)INK

+ S Am(k')2 exp{k'$BmU{O}

Am(k')Pr{k# = k'}

}

A2(k')f 2(m)N32K2y2a

Because the volume of Om is not smaller than R'Ml/2d (Lemma 4), we have

2YmaxK[1 - f(M)]N < 2YmaxK (1-

The real-valued function f(x) = xe-x2 < a-I if a

A2 (k')Am(k')2 exp{ - 32

k'$BmU{O}

k'VBmU{O}

< 8K2 Ymax

e 2 f(0m) VN

pRL2dK) N

> 0, therefore

f2(0m)N

2y ax

f2(m) -32K2ya

73

Part 2 can be bounded as below.

part 2

M

= E{p(X, k*) - p(X, k*)IX E OmIf(m)m=1

M dKd N5 f( Om)2YmaxK 1 MpRL/2dKm=1

M 8K2Ymax+ f(#(m) 1

_=1 e f(# m) vN

2YmaxK I - pR/2dK )N + 8K2 YmaxMM e21 /N

Combining part 1 and part 2 completes the proof.

3.4 Summary

In this chapter, we present the theoretical analysis of tree-based algorithms in uplift

modeling. Separating training data into one set for model training and the other one

for response prediction, we prove that the algorithm is consistent. Thus our algorithm

is asymptotically optimal with infinite training data. Moreover, experimental results

show that this approach will not sacrifice prediction performance. Applying dyadic

splits, we obtain the upper bound of the expected regret of our algorithm and the

asymptotically tightest bound of the regret.

The two tricks used here provide some guidance for the theoretical analysis of tree-

based algorithms in general. Decision trees have shown superior performance in many

applications, while its statistical properties need further exploration. Using different

datasets for model training and response prediction is a common way to reduce the

bias from outliers. The consistency analysis is critical as it ensures that the algorithm

is asymptotically optimal. Dyadic decision trees can attain nearly optimal rates of

74

convergence and approximate complex decision boundary. The restriction to dyadic

splits makes the asymptotically property analysis more tractable.

75

Chapter 4

Uplift Modeling for Observational

Studies

4.1 Introduction

Though randomized experiment is the ideal method to collect data for treatment

selection, it is not the final answer and there are many restrictions that limit gener-

alizability. For example, random experiments are restricted to patients wit limited

disease, comorbidity, and concomitant medications. In these contexts, observational

studies, nonrandomized studies play a role. Nowadays, there is a growing interest in

using observational studies to estimate the treatment effects on responses due to the

widespread accumulation of data in fields such as healthcare, education, and ecology.

In observational studies, treatment selection is influenced by subject characteris-

tics. For example, we have an electronic health record dataset collected over several

years. For each patient, we have access to lab tests and past diagnoses of patients,

their medications, and responses, but we do not have complete knowledge of why a

specific medication was applied to a patient, which may highly depend on the patient's

characteristics. For example, richer patients might better afford certain medications,

or patients with a past disease cannot be applied to a specific medication. As a re-

sult, baseline characteristics of subjects often differ systematically between different

treatments.

77

X X

Uy Y T UY Y T

Figure 4-1: Causal graphical models of randomized experiment (left) and observationalstudies (right)

We use causal graphical model to illustrate the difference between randomized

experiment and observational studies in Figure 4-1. Causal graphical models are

tools to visualize causal relationships between variables [Elwert 2013]. Nodes are

variables and edges are causal relationships. In the Figure 4-1, Y represents the

response variable, X represents the features that can be observed but not influenced,

T represents the assigned treatment, and Uy represents all the unobserved exogenous

background factors that influence Y. Yet a great deal of research is aimed at discov-

ering the structure of the causal graph embedded in the observed data correlations

[Heckerman 1995] [Hackerman 1997], we assume the causal graph structure is a priori

given. Though our focus is not to deduce the causal links, causal graphs are sug-

gestive of bias and can provide a starting point of identifying variables that must be

measured and controlled to obtain unfounded effect estimates.

Unlike randomized experiments, observational studies do not automatically con-

trol for treatment selection biases. The treatments observed in the data depend on

variables which might also affect the response. For example, richer patients might bet-

ter afford certain medications, and job training might only be given to those motivated

enough to seek it. This gives rise to the challenge of untangling these confounding fac-

tors and making valid predictions. Therefore, statistical methods involving matching,

stratification, and covariance adjustment are needed.

78

Numerous methods exist to handle confounding, which includes propensity score

matching [Mani 2006], structural marginal models [Robins 2000], and g-estimation

[Bielbyy 19771. Doubly robust methods combine re-weighting the samples and co-

variate adjustment in clever ways to reduce model bias. These methods do not yield

themselves immediately to estimating an individual treatment effect, and predicting

optimal treatment afterwards. Adapting them for that purpose remains unresolved.

We will show in Section 4.2 that our algorithm solves this challenge by learning a rep-

resentation of data that makes distributions more similar, and training an ensemble

tree model on top of it.

4.2 Algorithm

Let us consider the case of binary treatment, where there is a sample of the population

selected and subjected to the treatment, called treatment set with t = 1, and another

selected sample is not applied by the treatment, called control set with t = 0.

Uplift modeling with observational data requires amending the direct modeling

approach to balance feature distributions. A naive way of obtaining a balanced rep-

resentation is to use only features that are already well balanced, i.e., features which

have a similar distribution over different treatments. However, imbalanced features

can be highly predictive of the response, and should not be discarded all the time.

A middle-ground is to restrict the influence of imbalanced features on the response.

Under linear assumptions, we can use re-weighting matrix to transform the feature

space and achieve a trade-off between its predictive capabilities and its balance.

However, in various conditions, linear assumptions are not satisfied and the un-

derlying data structure can be very complicated. Deep neural networks have been

shown to successfully learn good representations [Bengio 2013]. [Johansson 2016] put

forward a modification of the standard feed-forward architecture with fully connected

layers. The first few hidden layers are used to learn a representation 4D of the input

x. The layers following 4D take the treatment assignment as additional input and

generate a prediction of the response. This framework, called CFR, jointly learns the

79

response function for all treatments which minimizes a weighted sum of the loss, and

the Integral Probability Metric (IPM) distance between distribution under treatment

and control induced by the feature representation. This model allows for learning

complex non-linear representations and response functions with large flexibility.

We combine CFR and CTS to utilize their advantages at the same time to build

our model: the neural network architecture is to learn a feature representation forcing

balanced distributions among treatments, and the CTS modular takes the new feature

as an input to predict the optimal treatment. The model framework is shown in

Figure 4-2.qo(.) and qi(.) are the response function under treatment and control, L

is a loss function, and IPMG is an integral probability metric. The representation

D and response function q are to minimize a summation of predictive accuracy and

representation space imbalance, as the following objective This algorithm, called NN-

CTS, is outlined in Algorithm 4.

min wi -L(q((xi), ti), yi) + A -R(q) + a IPMG( (i)i:tt=O, {)(Xzi)}i:ti=1)

ti 1 - t- Nwith wi = -2 + 21 where p = - t + i,2 pt 2(1 -tp)N

and R is a model complexity term.

4.3 Experimental Results

Evaluating uplift modeling for observational studies is infeasible with real world data

as population distribution and ground truth of optimal treatment are both unknown,

thus we use synthetic datasets with known distribution and response function to

compare model performance.

We report a few more results to compare the response prediction accuracy. For

a subject x, and for each potential treatment t, we denote the potential response as

yt(r). The quantity Y(x) - Yo(x), called individualized treatment effect(ITE), is of

high interest. Another commonly sought after quantity is the average treatment effect,

80

L

if t =1 L( 0 ), y)

if t = 0 'I''( ,

IP MG =b ,LJ

?t1)

t - Optimal treatment t*

Figure 4-2: Neural network architecture for uplift modeling

ATE = Ex~,(x)[ITE(x)]. Some quantities of interest is the RMSE of the estimated

individual treatment effect, denoted EITE; the absolute error in estimated average

treatment effect, denoted EATE; and the Precision in Estimation of Heterogeneous

Effect(PEHE), PEHE = 1(() - )- (Y1 (x) - O(i)))2.

4.3.1 Synthetic Data 1

The feature space is the fifty-dimensional hyper-cube of length 10, or X50 = [0, 10]50.

Features are uniformly distributed in the feature space, i.e., Xd - U[ 0, 10 ], for d =

1, ..., 50. There are two treatments, T = 0, 1. The response under each treatment is

defined as below.

f(X) + U[0, aX1] + E if T = 0,

f(X) + U[O, aX2 ]+E ifT=1.

(4.1)

The response is the sum of three components.

o The first term f(X) defines the systematic dependence of the response on the

features and is identical for all treatments. Specifically, f is chosen to be a

81

Algorithm 4 NN-CTS - Contextual Treatment Selection with balanced features

Input: training data SN = (x 1 , t1 , y ',' , (N N yN ), scaling parameter a > 0,

loss function L(-, .), representation network 4Dw with initial weights W, outcome

network hv with initial weights V, function family G for IPM, number of samples

in each tree B (B < N), number of trees ntree, number of variables to be consid-

ered for a split mtry (1 < mtry < d), the minimum number of samples required

for a split min-split, the regularity factor n.reg

fraining:

1. Update the representation network weight W and outcome network weight

V until convergence criterion is met.

2. Train ntree trees using training data features transformed after representa-

tion network.

Prediction: Given a new point x in the feature space, the new feature xj is the

output of representation network <Dw, and the predicted expected response under

a treatment is the average of the predictions from all the trees. The optimal

treatment is the one with the largest predicted expected response.

mixture of 50 exponential functions so that it is complex enough to reflect real-

world scenarios.

f (x, ..., xso) (4.2)

50= a' - exp{--b'jxi - c'l- - - -b'IoX50 - Ciol},

i=1

where a', bj and cj are chosen randomly.

" The second term U[0, aX] is the treatment effect and is unique for each treat-

ment t. In many applications we would expect the treatment effect to be of

a lower order of magnitude of the main effect, so we set a to be 0.4 which is

roughly 5% of E[If(X)|].

" The third term E is the zero-mean Gaussian noise, i.e. c ~ N(0, a2 ). Note that

the standard deviation o- of the noise term is identical for all treatment. o is

82

set to 0.8 which is twice the amplitude of the treatment effect a.

Under this particular data model, the expected response is the same for all treatments,

i.e., E[YIT = t] = 5.18 for t = 0,1. The expected response under the optimal

treatment rule E[YIT = h*(X)] is 5.79.

Data under treatment 0 and 1 follow the following distributions:

P(XIT = 0) = U 49[0, 10] x Tri(0, 10, 0)1

P(XIT = 1) = U 49[0, 10] x Tri(0, 10, 10)

This is to simulate the real world situation that subjects are assigned to treatment

with different probability distribution based on their characteristics.

We compare the performance of 5 different methods. They are NN-CTS, CFR,

CTS, Separate Model Approach with Random Forest (SMA-RF), Support Vector Re-

gressor with Radial Basis Kernel (SMA-SVR). CFR is implemented as a feed-forward

neural network with 3 fully-connected exponential-linear layers for the representation

network and 3 for the outcome network. Layer sizes are 100 for all layers. The model

is trained using Adam [Kingma 20141. Layers corresponding to the outcome are reg-

ularized with a small 12 weight decay. These algorithms are tested under increasing

training data size, specifically 10000, 20000, 30000, 40000, 50000 and 60000 samples

per treatment. For each size, 10 training data sets are generated so that we can com-

pute the margin of error of the results. To ensure consistency in comparison, for each

data size, all the methods are tested with the same 10 datasets. The performance of

a model is evaluated using Monte Carlo simulation and the true data model.

The performance of the 5 methods are plotted in Fig. 4-3. For reference, we also

plot the single treatment expected response and the optimal expected response. The

vertical bars are the 95% margin of error computed with results from 10 training

datasets. Response prediction accuracy results are presented in Table 4.1. We can

see from Fig. 4-3 that specialized algorithms (NN-CTS, CRF, CTS) surpasses the

separate model approach, and the advantage continues to grow as the training size

'We use Tri(a, b, c) to denote a triangular distribution with lower limit a, upper limit b, andmode c.

83

- -Optimal ----- Single Trt -A-SMA-RF -0- SMA-SVR-O-CTS <> CFR -o-NN-CTS

5.7

c 5.60

_0

a> - - - .. ....-------( . .-.-........ -.---x .-------

a 55.....-~5.3

5.2 - -

5.110000 20000 30000 40000 50000 60000


Figure 4-3: Averaged expected response of different algorithms on the synthetic datasimulating observational studies. The error bars are 95% margin of error computed withresults from 10 training datasets. For each data size, all algorithms are trained with thesame 10 datasets.

increases. The performance of NN-CTS is significantly better than CTS for all data

sizes. This shows the advantage of feature representation when data distributions

are different among treatments. NN-CTS outperforms over CFR though they both

use neural network to learn new features. The combination of neural network feature

representation and ensemble tree model for response prediction is promising to handle

real world problems.

84

Table 4.1: Standard errors of 10 repeated experiments for datasets simulating observationalstudies

To make a better visualization of feature distributions, we use t-distributed neigh-

bor embedding to project the high-dimensional feature vector to 2-dimensional space

and plot them in Fig. 4-4. Red and blue dots represent treatment 0 and 1 respec-

tively. Note the blue scattered points spread along the outermost circle in the left two

figures, they show the clear differences between feature distributions in the original

data, while this difference disappears in the feature representation shown in the right

two figures.

4.3.2 Synthetic Data 2

In this section, we consider a data set with the same feature space and response

function as in Section 4.3.1, except that the data under two treatment are sampled

from uniform distribution U' 0 [0,10]. This is a randomized experiment, but we will

show next the behaviors of different models.

We compare the performance of 5 different methods. They are NN-CTS, CFR,

CTS, Separate Model Approach with Random Forest (SMA-RF), Support Vector

Regressor with Radial Basis Kernel (SMA-SVR). These algorithms are tested under

increasing training data size, specifically 10000, 20000, 30000, 40000, 50000 and 60000

samples per treatment. For each size, 10 training data sets are generated so that we

85

CITE CATE PEHE

SMA-RF 2.4 0.1 0.7 0.1 4.1 0.2

SMA-SVR 1.9 0.2 0.6 0.1 3.7 i 0.2

CTS 1.7 0.1 0.4 0.1 2.9 0.1

CFR 1.4 0.1 0.3 0.1 2.1 + 0.1

NN-CTS 1.2 0.1 0.3 0.1 1.9 0.0

.0

original data (N=10000)

I-.

original data (N=60000)

VAA

feature representation (N=10000)

feature representation (N=60000)

Figure 4-4: Feature distribution after t-distributed stochastic neighbor embedding (t-SNE)

86

- -Optimal s--- Single Trt -- SMA-RF -O SMA-SVR-O-CTS c CFR +0- NN-CTS

5.7

5.600--

5.2.-- - -

a)

5.4

10000 20000 30 00 40000 50000 60000


Figure 4-5: Averaged expected response of different algorithms on the synthetic datasimulating randomized experiments. The error bars are 95% margin of error computed withresults from 10 training datasets. For each data size, all algorithms are trained with thesame 10 datasets.

can compute the margin of error of the results. The performance of a model is

evaluated using Monte Carlo simulation and the true data model.

The performance of the 5 methods are plotted in Fig. 4-5. Response prediction

accuracy results when data size is 60000 are presented in Table 4.2. When data sizes

are small (10000 and 20000), NN-CTS and CFR perform better than CTS. This is

because even though data distribution under different treatments are the same, the

generated data may look differently when the data size is very small and feature space

is very large. Therefore, methods with feature representation help reduce this bias on

small dataset. This is helpful in real world as there may be no large datasets available

in some industry like insurance or at the beginning stage of an experiment. When

data sizes are large, NN-CTS and CFR suffer from the overestimate of unnecessary

87

feature representation.

Table 4.2: Standard errors of 10data

repeated experiments simulating randomized experiment

4.3.3 Simulation based on Real Data

A semi-simulated dataset based on the Infant Health and Development Program

(IHDP) was introduced by [Hill 2011]. The IHDP data studies the effect of high-

quality child care and home visits on future cognitive test scores in a real randomized

experiment. [Hill 2011] proposes an experiment which uses a simulated outcome and

artificially introduced imbalance between treated and control subjects by removing

a subset of the treated group. The dataset consists of 747 subjects in total, 139

in treatment group and 608 in control group. There are 25 covariates measuring

properties of the child and their mother for each subject. See Table 4.3 for results. It

shows that NN-CTS achieves lower errors than separate model approach and CFR.

Though we would still need to compare their performance of the expected response

on live experiment.

88

CITE CATE PEHE

SMA-RF 2.1 0.2 0.5 0.0 3.7 0.2

SMA-SVR 1.7 0.2 0.5 0.1 2.9 0.2

CTS 1.4 0.1 0.4 0.1 2.2 0.1

CFR 1.1 0.1 0.3 0.1 1.7 0.1

NN-CTS 1.1 0.1 0.2 0.1 1.5 0.0

Table 4.3: Standard errors for 10 repeated experiments on IHDP dataset

4.4 Summary

In this chapter, we present a model for uplift modeling with observational studies

data. As there is no automatic control for treatment selection bias, we need addi-

tional techniques to untangle the confounding factors. Seeing the success of neural

network in various applications of feature representation, we adopt a framework of

feed-forward architecture with fully connected layers in [Johansson 2016] to learn a

feature representation that jointly minimizes response prediction error and feature

distribution distance between treated and control group. CTS takes the new feature

as input and predicts the optimal treatment. Experimental results on synthetic data

show that the combination of neural network feature representation and ensemble

tree model can solve the challenges encountered in observational studies well. How-

ever, the evaluation with data collected in observational studies still remains an open

question.

89

CITE EATE PEHE

SMA-RF 3.2+0.2 0.8 0.0 4.9+0.2

SMA-SVR 2.8+0.2 0.5 0.0 3.7+0.2

CTS 1.8 0.1 0.3+0.0 2.1+0.1

CFR 1.7 0.0 0.3 0.0 1.6+0.1

NN-CTS 1.6t0.1 0.3+0.0 1.6 0.0

Chapter 5

Conclusion

5.1 Summary

We have seen uplift modeling put forward initially in the field of marketing and in-

surance, while it has much broader applications in many areas, such as healthcare,

economic and sociology. Data collected from randomized experiments and observa-

tional studies can both be used to analyze individualized treatment effect and achieve

uplift by personalized treatment assignment.

In this thesis, we first present the formulation of uplift modeling in a more general

framework. Based on this formulation, we put forward an unbiased estimate of the

expected response for an uplift model. This evaluation metric expands the scope

of uplift modeling as we are able to evaluate a broad range of models from binary

treatments to multiple treatments, from discrete responses to continuous responses.

Based on this evaluation metric, we present our algorithm, Contextual Treatment

Selection (CTS), for uplift modeling with randomized experiment data. This fills

a critical vacancy in the field of Uplift Modeling by being able to handle multiple

treatments and continuous response. Experimental results on synthetic and industry

provided data demonstrate that CTS can lead to significant increase in expected

response relative to other applicable methods. To further analyze the asymptotic

properties, we apply dyadic splits which splits each axis at the midpoint and obtain

the convergence rate of a generic CTS algorithm.

91

5.2 Future Work

As machine learning is becoming a major tool for researchers and policy makers

across different fields such as healthcare and economics, contextual treatment selection

becomes a crucial issue for the practice of machine learning. In most machine learning

problems, we assume that the underlying data model is stable. We also make this

assumption along the thesis discussion. However, things are always changing in real

world, and machine learning in non-stationary environment is a widely researched

topic. This issue also arises in uplift modeling problem. For example, in the airline

customized pricing problem, passengers' purchasing behavior can change over time.

Reasons can be but not limited to the business strategy change in the airline and

competing airlines, important holidays like Christmas/New Year, exogenous events

like terrorism or natural disasters, the general economic situations, etc. Fig. 5-1

shows how average revenue for different treatments changes over time from the start

of our randomized experiment. Without complete knowledge of these information, it

is usually difficult to incorporate them in models.

0.11

0.10

0.09

0.08 -

0.07

Apr 27 May 4 May 11

Date

Figure 5-1: The change of average revenue for different treatments over time with 90%confidence band (red: default price, blue: high price)

92

Although such a time heterogeneity exists, the change evolves relatively slowly

compared to the data collection procedure, as we can also see from Fig. 5-1 In

the case of seat reservation in Chapter 2, data collected from a 3-week randomized

experiment may be enough to train an uplift model that performs well for three

months at a satisfactory level. The common approach is to periodically retrain the

model with new data so that the model can always represent the latest user behavior

pattern. However, choosing the appropriate data collection period and model refresh

frequency needs to be treated carefully with further investigation of the problem itself

and may require a lot of trial and error. Designing a systematic way to update models

automatically is well worth exploring, which can help not only uplift modeling but

also more broader areas in machine learning.

Open questions which remain are how to generalize the method under observa-

tional studies for cases where more than one treatment is in question, deriving better

optimization algorithms and using richer discrepancy measures. Theoretical consider-

ations include the choice of IPM weight a, and the integration of our work with more

complicated causal models such as those with hidden confounding or instrumental

variables.

93

Appendix A

Parameter Selection for CTS

Here we describe the detail on parameter tuning that produces the results in Chapter 2

of the paper .

* CTS: Fixed parameters are the number of trees ntree=100, the number of fea-

tures considered at each split mtry=25 and the regularization factor nreg=3.

The minimum number of samples required for a split min-split is selected

among [25, 50, 100, 200, 400, 800, 1600, 3200, 6400] (Large values are

omitted when they exceeds dataset size). For synthesized data, min-split is

selected using 5-fold cross-validation when training data size is 500, 1000, 2000,

4000 and 8000 sample-per-treatment. Because of time-constraint, for sample

size 16000 and 32000, min-split is selected using validation (half training/half

test) on one data set and kept the same for the other nine data sets.

* RF: Fixed parameters are the number of trees nestimators=100, the number

of features considered at each split max_features=25. The minimum number

of samples in leaf node nodesize is tuned with 5-fold cross-validation among

[1,5,10,20].

* KNN: The number of neighbors nneighbors is tuned with 5-fold cross-validation

among [5,10,20,40].

" SVR: The regularization parameter C and the value of the insensitive-zone c are

95

determined analytically using the method proposed in [Cherkassky 2004]. The

spread parameter of the radial basis kernel -y is selected among [10-4, 10-3, 10-2, 10-1]

using 5-fold cross-validation.

. Ada: The number of estimators nestimators=100. Square loss.

96

Appendix B

Variable Selection

Here we describe how we select variables for the real world data in Section 2.3.

Variable selection is critical to real-world problems as we have little knowledge of the

underlying data model. New variables may bring new information to the prediction,

while irrelevant, redundant, or weak variables may cause overfitting and reduce the

interpretability of the model. We first extract meaningful variables from the raw data

(e.g. transaction record of each flight ticket purchase), and select the optimal variable

set using forward-greedy approach.

* Seat Reservation Data

booking hour, booking weekday,

travel hour, travel weekday,

number of days between the last login and the next flightAvailable variables

zone code, fare class,

days between the last login and the closest holiday,

group size, flight ticket price

booking hour, booking weekday,

travel hour, travel weekday,

Optimal variable set number of days between the last login and the next flight

zone code, fare class

Table B.1: Variable selection on seat reservation data

97

Appendix C

Parameter Selection for NN-CTS

We follow the approach described in C.1 of [Johansson 2016] for hyperparameter

selections. See Table C.1 for a description of hyperparameters and search ranges.

Table C.1: Parameters and ranges

99

Parameter Range

Imbalance parameter a { jk/2 }6

Number of representation layers {1, 2, 3}

Number of hypothesis layers {1, 2, 3}

Dimension of representation layers {20, 50,100, 200}

Dimension of hypothesis layers {20, 50,100, 200}

Batch size {100, 200, 500, 700}

Bibliography

[Agostino 1995] Ralph B. D'Agostino, and Heidy Kwan (1995). Measuring effective-ness: what to expect without a randomized control group. Medical Care, vol. 33,no. 4, pages 95-105.

[Agostino 19981 Ralph B. D'Agostino, Jr (1998). Propensity score methods for biasreduction in the comparison of a treatment to a non-randomized control group.Statistics in Medicine, 17, pages 2265-2281.

[Agrawal 1994] Rakesh Agrawal, and Ramakrishnan Srikant (1994). Fast algorithmsfor mining association rules. In Proceeding of VLDB Conference, pages 487-499.

[Alemi 2009] Farrokh Alemi, Harold Erdman, Igor Griva, and Charles H. Evans(2009). Improved Statistical Methods Are Needed to Advance PersonalizedMedicine. The open Translational Medical Journal, 1, 16-20.

[Alina 20111 Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and RobertE. Schapire (2011). Contextual bandit algorithms with supervised learning guar-antees. Proceedings of the International Conference on Artificial Intelligence andStatistics (AISTATS), JMLR Workshop and Conference Proceedings Volume 15.

[Angrist 20081 Joshua D Angrist and Jorn-Steffen Pischke (2008). Mostly harmlesseconometrics: An empiricist's companion. Princeton university press.

[Ascarza 2016] Eva Ascarza (2016). Retention futility: Targeting high risk customersmight be ineffective. Available at SSRN.

[Athey 2016] Susan Athey and Guido Imbens (2016). Recursive partitioning for het-erogeneous causal effects. Proceedings of the National Academy of Sciences, vol.113, no. 27, pages 7353-7360.

[Auer 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer (2002). Finite timeanalysis of the multi-armed bandit problem. Machine Learning, vol. 47, pages235-256.

[Auer 2003] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire(2003). The nonstochastic multi-armed bandit problem. SIAM Journal on Com-puting, vol. 32, no. 1, pages 48-77.

101

[Austin 2011] Peter C. Austin (2011). An introduction to propensity score methodsfor reducing the effects of confounding in observational studies. Multivariate be-

havioral research, vol. 46, no. 3, pp. 399-424.

[Bang 2005] Heejung Bang, James M. Robins (2007). Doubly robust estimation inmissing data and causal inference models. Biometrics, vol. 61, no. 4, pp. 962-973.

[Ben 2007] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira

(2007). Analysis of representations for domain adaptation. Advances in Neural

Information Processing Systems, pp. 137-144.

[Bengio 2013] Yoshua Bengio, Aaron Courville, Pascal Vincent (2013). Representa-

tion learning: A review and new perspectives. Pattern Analysis and Machine

Intelligence, IEEE Transactions on, 35(8):1798-1828.

[Bottou 2013] Leon Bottou, Jonas Peters, Joaquin Quinonero-Candela, Denis X.

Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, EdSnelson (2013). Counterfactual reasoning and learning systems: The example of

computational advertising. The Journal of Machine Learning Research, vol. 14,no. 1, pp. 3207-3260.

[Bielbyy 1977] William T. Bielby and Robert M. Hauser (1977). Structural equationmodels. Annual review of sociology vol. 3, pp. 137-161.

[Breiman 1984] Leo Breiman, Jerome Friedman, Charles J. Stone, and R. A. Olshen

(1984). Classification and regression trees. Wadsworth Statistics/Probability.

[Breiman 2001] Leo Breiman (2001). Random forests. Machine learning, vol. 45, no.

1, pp. 5-32.

[Chen] Xin Chen, Zack Owen, Clack Pixton, and David Simchi-Levi. A Statistical

Learning Approach to Personalization in Revenue Management.

[Chen 2016] Tianqi Chen and Carlos Guestrin (2016). Xgboost: A scalable tree boost-

ing system. arXiv:1603.02754.

[Cherkassky 2004] Vladimir Cherkassky, Yunqian Ma (2004). Practical selection ofSVM parameters and noise estimation for SVM regression. Neural networks, 17(1),113-126.

[Chickering 2000] David Maxwell Chickering and David Heckerman (2000). A de-cision theoretic approach to targeted advertising. Proceedings of the Sixteenth

Conference on Uncertainty in Artificial Intelligence, pp. 82-88.

[Cochran 1965] W. G. Cochran (1965). The planning of observational studies of hu-man populations. Journal of the Royal Statistical Society, 128, pp. 134-155.

[Cornia 20111 Marco Cornia, Kristopher S. Gerardi, Adam Hale Shapiro (2011). Price

Discrimination and Business-Cycle Risk. Atlanta, Federal Reserve Bank of At-

lanta.

102

[Cortes 2014] Corinna Cortes, and Mehryar Mohri (2014). Domain adaptation andsample bias correction theory and algorithm for regression. Theoretical ComputerScience, 519:103-126.

[Delgado 2014] M. Fernandez-Delgado, Eva Cernadas, Senen Barro, and DinaniAmorim (2014). Do we Need Hundreds of Classifiers to Solve Real World Classifi-cation Problems?, Journal of Machine Learning Research, vol. 15, pp. 3133-3181.

[Eisenstaein 2007] Eric L. Eisenstein, Kevin J. Anstrom, David F. Kong, Linda K.Shaw, Robert H. Tuttle, Daniel B. Mark, Judith M. Kramer, Robert A. Harring-ton, David B. Matchar, David E. Kandzari, Eric D. Peterson, Kevin A. Schulman,Robert M. Califf (2007). Clopidogrel use and long-term clinical outcomes afterdrug-eluting stent implantation. JAMA, 297:159-168.

[Elwert 2013] Felix Elwert (2013). Graphical causal models. In Handbook of causalanalysis for social research, pp. 245-273, Springer.

[Fisher 19351 Ronald A. Fisher (1935). The Design of Experiments, London: Oliverand Boyd.

[Flaxman 2005] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMa-han (2005). Online convex optimization in the bandit setting: gradient descentwithout a gradient. In Proceedings of the Annual ACM-SIAM Symposium on Dis-crete Algorithms, pp. 385-394.

[Freedman 1997] David Freedman (1997). From association to causation via regres-sion. Advances in Applied Mathematics, vol. 18, no. 1, pp. 59-110.

[Gey 2015] Servane Gey, and Elodie Nedelec (2005). Model selection for cart regres-sion trees. IEEE Transaction on Information Theory, vol. 51, no. 2, pp. 658-670.

[Greenland 1999] Sander Greenland, Judea Pearl, and James M. Robins (1999).Causal Diagrams for Epidemiologic Research. Epidemiology, 10, pp. 37-48.

[Guelman 2014a] Leo Guelman, Montserrat Guillenb, Ana M. Perez-Marin (2014a).A survey of personalized treatment models for pricing strategies in insurance.Mathematics and Economics, 58, pp. 68-76.

[Guelman 2014b] Leo Guelman, Montserrat Guillenb, Ana M. Perez-Marin (2014b).Optimal personalized treatment rules for marketing interventions: A review ofmethods, a new proposal, and an insurance case study. Technical report.

[Guelman 2014] Leo Guelman (2014). uplift: Uplift Modeling. R package version0.3.5. http://CRAN.R-project.org/package-uplift

[Hansotia 20021 Behram Hansotia and Brad Rukstales. Incremental value modeling(2002). Journal of Interactive Marketing, 16(3):35.

103

[Heckerman 1995] David Heckerman (1995). A bayesian approach to learning causalnetworks. In Proceedings of the Eleventh Conference on Uncertainty in ArtificialIntelligence, pp. 285-295.

[Hackerman 1997] David Heckerman (1997). Bayesian networks for data mining. DataMining and Knowledge Discovery, 1(1):79-119.

[Hill 2011] Jennifer L. Hill (2011). Bayesian nonparametric modeling for causal infer-ence. Journal of Computational and Graphical Statistics, 20(1).

[Holland 1986] Paul W. Holland (1986). Statistics and causal inference. Journal ofthe American statistical Association, vol. 81, no. 396, pp. 945-960.

[Holland 1988] Paul W. Holland and Dorothy T. Thayer (1988). Differential itemperformance and the mantel-haenszel procedure. Test Validity, pp. 129-145.

[Huupponen 2013] Risto Huupponen, and Jorma Viikari (2013). Statins and the riskof developing diabetes. BMJ, 346.

[Jaskowski 2012] Maciej Jaskowski, and Szymon Jaroszewicz (2012). Uplift modelingfor clinical trial data. In ICML Workshop on Clinical Data Analysis, 2012.

[Johansson 2016] Fredrik D. Johansson, Uri Shalit, David Sontag (2016). Learningrepresentations for counterfactual inference. In Proceedings of the 33rd Interna-tional Conference on Machine Learning (ICML).

[Kingma 2014] Diederik Kingma and Jimmy. Adam Ba (2014). A method for stochas-tic optimization. arXiv preprint arXiv:1412.6980.

[Kleinberg 2005a] Robert D. Kleinberg (2005a). Nearly tight bounds for thecontinuum-armed bandit problem. In Advances in Neural Information Process-ing Systems, pages 697-704.

[Kleinberg 2005b] Robert D. Kleinberg (2005b). Online Decision Problems with LargeStrategy Sets. PhD thesis, Massachusetts Institute of Technology, June 2005b.

[Kuusisto 20151 Finn C Kuusisto (2015). Machine Learning for Medical Decision Sup-port and Individualized Treatment Assignment. PhD thesis, University of Porto,2015.

[Lai 1985] T. L. Lai, and Herbert Robbins (1985). Asymptotically efficient adaptiveallocation rules. Advances in Applied Mathematics, 6(1):4-22.

[Li 2013] Jiuyong Li, Thuc Duy Le, Lin Liu, Jixue Liu, Zhou Jin, and Bingyu Sun(2013). Mining causal association rules. 2013 IEEE 13th International Conferenceon Data Mining Workshops, pages 114-123.

[Li 2015] Jiuyong Li, Thuc Duy Le, Lin Liu, Jixue Liu, Zhou Jin, Bingyu Sun, andSaisai Ma (2015). From observational studies to causal rule mining. ACM Trans-actions on Intelligent Systems and Technology (TIST), 7(2):14.

104

[Lo 2002] Victor S. Y. Lo (2002). The true lift model - A novel data mining ap-proach to response modeling in database management. ACM SIGKDD Explo-rations Newsletter, 4(2):78-86.

[Manahan 20051 Charles Manahan (2005). A proportional hazards approach to cam-paign list selection. In SAS User Group International (SUGI) 30 Proceedings.

[Mani 2006] Subramani Mani, Peter L. Spirtes, and Gregory F. Cooper (2006). Atheoretical study of Y structures for causal discovery. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence

[Mcclellan 1994] Mark McClellan, Barbara J. McNeil, and Joseph P. Newhouse(1994). Does more intensive treatment of acute myocardial infarction in the elderlyreduce mortality? Analysis using instrumental variables. JAMA, 272:859-866.

[Nassif 2013] Houssam Nassif, Finn Kuusisto, Elizabeth S Burnside, and Jude WShavlik (2013). Uplift modeling with roc: An srl case study. In ILP (Late BreakingPapers), pages 40-45.

[Neyman 1935] J. Neyman (1935). Statistical Problems in Agricultural Experiments.Journal of the Royal Statistical Society, vol. II, no. 2, pp. 107-180.

[Perchet 2013] Vianney Perchet, and Philippe Rigollet (2013). The multi-armed ban-dit problem with covariates.Annals of Statistics, vol 41, no. 2, pp. 692-721.

[Radcliffe 2007] Nicholas J. Radcliffe (2007). Using control groups to target on pre-dicted lift: Building and assessing uplift models. Direct Marketing AnalyticsJournal, An Annual Publication from the Direct Marketing Association AnalyticsCouncil, pages 14-21.

[Radcliffe 2008] Nicholas J. Radcliffe (2008). Hillstroms'mine that data email analyt-ics challenge: An approach using uplift modeling. Stochastic Solutions Limited,pages 1-19.

[Radcliffe 2011] Nicholas J. Radcliffe, and Patrick D. Surry (2011). Real-world upliftmodeling with significance-based uplift trees. White Paper TR-2011-1, StochasticSolutions (2011).

[Robins 2000] James Robins, Miguel Angel Hernan, and Babette Brumback (2000).Marginal structural models and causal inference in epidemiology. Epidemiology,pages 550-560.

[Rosenbaum 1983] Paul R. Rosenbaum, and Donald B. Rubin (1983). The centralrole of the propensity score in observational studies for causal effects. Biometrika,vol. 70, no. 1, pp. 41-55.

[Rosenbaum 2002] Paul R. Rosenbaum (2002). Observational studies. Springer.

105

[Rubin 1974] Donald B. Rubin (1974). Estimating causal effects of treatmentsin randomized and nonrandomized studies. Journal of educational Psychology,66(5):688.

[Rubin 2007] Donoal B. Rubin (2007). The design versus the analysis of observationalstudies for causal effects: parallels with the design of randomized studies. Statisticsin Medicine, 26:20-36.

[Rusmevichientong 2010] Paat Rusmevichientong, John N. Tsitsiklis (2010). Linearlyparameterized bandits. Mathematics of Operations Research, 35:395-411.

[Rzepakowski 2010] Piotr Rzepakowski, and Szymon Jaroszewicz (2010). Decisiontrees for uplift modeling, 2010 IEEE International Conference on Data Mining,pp. 441-450.

[Rzepakowski 2012] Piotr Rzepakowski, and Szymon Jaroszewicz (2012), Decisiontrees for uplift modeling with single and multiple treatments. Knowledge andInformation Systems, 32(2):303-327.

[Rzepakowski 2015] Michal Soltys, Szymon Jaroszewicz, and Piotr Rzepakowski(2015). Ensemble methods for uplift modeling. Data mining and knowledge dis-covery, 29:1531-1559.

[Sekhon 2008] Jasjeet Sekhon (2008). The Neyman-Rubin model of causal inferenceand estimation via matching methods. The Oxford handbook of political method-ology, pages 271-299.

[Silverstain 2000] Craig Silverstein, Sergey Brin, Rajeev Motwani, and Jeff Ullman(2000). Scalable techniques for mining causal structures. Data Mining and Knowl-edge Discovery, 4(2-3):163-192.

[Scott 2005] Clayton Scott (2005). Tree pruning with subadditive penalties. IEEETransaction on Signal Processing, vol. 53, no. 12, pp. 4518-4525.

[Stukel 2007] Therese A. Stukel, Elliott S. Fisher, David E. Wennberg, David A.Alter, Daniel J. Gottlieb, Marian J. Vermeulen (2007). Analysis of observationalstudies in the presence of treatment selection bias: effects of invasive cardiacmanagement on AMI survival using propensity score and instrumental variablemethods. JAMA. 2007;297:278-285.

[Su 2012] Xiaogang Su, Joseph Kang, Juanjuan Fan, Richard A. Levine, Xin Yan(2012). Facilitating score and causal inference trees for large observational studies.The Journal of Machine Learning Research, 13(10), 2955-2994.

[Wager 2017] Stefan Wager, and Susan Athey (2017). Estimation and inference ofheterogeneous treatment effects using random forests. Journal of the AmericanStatistical Association, forthcoming.

106

[Wang 2005a] Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor (2005a).Arbitrary side observations in bandit problems. Advances in Applied Mathematics,vol. 34, no. 4, pp. 903-938.

[Wang 2005b] Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor (2005b).Bandit problems with side observations. IEEE Transactions on Automatic Con-trol, vol. 50, no. 3, pp. 338-355.

[Wong 1978] Stanley Wong (1978). Foundations of Paul Samuelson's Revealed Pref-erence Theory: A Study by the Method of Rational Reconstruction. Routledge,ISBN 0-7100-8643-1.

[Zaniewicz 2013] Lucas Zaniewicz and Szymon Jaroszewicz (2013). Support vectormachines for uplift modeling. IEEE 13th International Conference on Data MiningWorkshops, pp. 131-138.

[Zhao 2017] Yan Zhao, Xiao Fang, and David Simchi-Levi (2017). A Practically Com-petitive and Provably Consistent Algorithm for Uplift Modeling. IEEE Interna-tional Conference on Data Mining.

107

redacte - dspace.mit.edu

Documents

Transcript of redacte - dspace.mit.edu