Causal data mining: Identifying causal effects at scale
-
Upload
amit-sharma -
Category
Science
-
view
173 -
download
1
Transcript of Causal data mining: Identifying causal effects at scale
1
Causal data mining: Identifying causal effects at scaleAMIT SHARMA Postdoctoral Researcher, Microsoft Research New Yorkhttp://www.amitsharma.in@amt_shrma
2
A tale of two questions
Q1: How much activity comes from the recommendation system?
Q2: How much activity comes because of the recommendation system?
3
How much activity comes because of the recommendation system?
A causal question.
With recommender
Without recommender
Real world Counterfactual world
2. Evaluating systems
1. Modeling user behavior
Understanding causal relationships from data
Distinguishing between personal preference and homophily in online activity feeds. Sharma and
Cosley (2016).
Studying and modeling the effect of social explanations in recommender systems. Sharma
and Cosley (2013).
Amit and Dan like this.
SOME MUSICAL ARTIST
2. Evaluating and improving systems
Understanding causal relationships from data
Distinguishing between personal preference and homophily in online activity feeds. Sharma and
Cosley (2016).
Studying and modeling the effect of social explanations in recommender systems. Sharma
and Cosley (2013).
Amit and Dan like this.
SOME MUSICAL ARTIST
Understanding causal relationships from data
Averaging Gone Wrong: Using Time-Aware Analyses to Better Understand Behavior. Barbosa, Cosley,
Sharma, Cesar (2016)
Auditing search engines for differential satisfaction across demographics. Mehrotra, Anderson, Diaz, Sharma,
Wallach (2016)
7
A core problem across the sciencesJake and Duncan like this
Understanding causal relationships from data
Code profiling, static analysis [Berger et al.]
Debugging machine learning [Chakarov et
al.]
Decision-making in robotics
8
Why is it hard?
Without:
Recommender System
algorithm Any code
change Social policy,
medical treatment
Observed Data from the Real world
No data from the Counterfactual
world
Without a randomized experiment, hard to estimate.
9
Difference between prediction and causation
Cause (X=) Outcome (Y)
Unobserved Confounders
(U)𝒚= 𝑓 (𝒙 ,𝑢) 𝑔 𝑓
𝑓
Hofman, Sharma, and Watts (2017). Science, 355.6324
10
Prediction: = ( )+𝑦 𝑘 𝑥 𝜖
¿ 𝑋 ,𝑌>¿�̂� ?
Hofman, Sharma, and Watts (2017). Science, 355.6324
Causation:
¿ 𝑋 ,𝑌>¿
11
Research goal
How can we use large-scale data to infer causal estimates?
Use algorithms to find experiment-like data: Quasi (”Natural”) experiments
12 PredictionCausation
𝑦=𝛽𝑥+𝜖
¿ 𝑋 ,𝑌>¿�̂�
¿ 𝑋 ,𝑌>¿
�̂�
Natural experiment
Combine Pearl’s causal graphical model framework with natural
experiments
14
Inverting the natural experiment paradigmHypothesize about a natural variation
Argue why it resembles a randomized experiment
Observational DataDevelop tests for validity of natural
variation
Mine for data subsets with such
valid variations
¿ 𝑋 ,𝑌>¿
15
¿ 𝑋 ,𝑌>¿
Natural Experimen
t
Natural Experimen
t
Natural Experimen
t
Natural Experimen
t
Natural Experimen
t
Natural Experimen
tNatural
Experiment
Since 1850s¿ 𝑋 ,𝑌>¿
16
¿ 𝑋 ,𝑌>¿Natural
Experiment
¿ 𝑋 ,𝑌>¿
1. Split-door Criterion Causal effect of recommender systems
2. Bayesian Natural Experiment Test Validate past economics studies
Data mining for causal inference
¿ 𝑋 ,𝑌>¿
17
Part 0: Traditional causal inference using a natural experiment
18
1854: London was having a devastating cholera outbreak
19
Causal question: What is causing cholera?Air-borne: Spreads through air (“miasma”)
Water-borne: Spreads through contaminated water
Polluted Air
Cholera Diagnosis
Contaminated Water
Cholera Diagnosis
Neighborhood
21
Enter John Snow. He found higher cholera deaths near a water pump, but could be just
correlational.
22
S & V
WATER
COM
PANY
LAMBETH
WATER
COMPANY
New Idea: Two major water companies for London:
one upstream and one downstream.
23
No difference in neighborhood, still an 8-fold increase in cholera with the downstream
company.
S&V and Lambeth
24
Led to a change in belief about cholera’s cause.
25
• Choice of water company cannot cause cholera.
• Choice of water company was not related to people’s neighborhood or its air quality. • People receiving water from the two companies were interspersed
within neighborhoods.
Why was Snow’s study so convincing?
Choice of water company cannot cause cholera.Choice of water company is not related to neighborhood.
Probably the first application of cause-effect principles
26
Exclusion
As-if-random
Contaminated Water
Cholera Diagnosis
Neighborhood
Water Compan
y
27
Contaminated Water (X)
Cholera Diagnosis
(Y)
Other factors [e.g.
neighborhood] (U)
Water Compan
y(Z)
As-If-Random
Exclusion
Two assumptions central to causal inference: Exclusion and As-if-random
28(𝑍 ∐𝑌∨𝑋 ,𝑈 )
Cause (X) Outcome (Y)
Unobserved Confounders
(U)
New variable
(Z)
As-If-Random
Exclusion
¿
Two assumptions central to causal inference: Exclusion and As-if-random
1930s: Fisher introduces randomized experiment
Since then, these assumptions have formed the core of causal inference
29
Cause (X)
Outcome (Y)
Unobserved Confounders
(U)
Randomized
Assignment (Z)
Exclusion: Randomized assignment should not affect outcome.As-if-random: Randomized assignment should be independent of unobserved confounders.
Z is now a special observed variable, called an instrumental variable.
All studies using observational data also need to satisfy these two assumptions
Cause (X)
Outcome (Y)
Unobserved Confounders
(U)
Instrumental Variable
(Z)30
But Exclusion and As-if-random are hard to establish, because of unobserved confounders.
32
More formally…
Full dataset Subsets of the data
¿ 𝑋 ,𝑌>¿
Expt
¿ 𝑋 ,𝑌>¿
Expt
Expt
Expt
Such that:As-If-Random: Exclusion:
Hard to verify from observed data.
33
Current methods haven’t changed much from that used by John Snow in 1850s.Use rhetorical arguments to justify an instrumental variable.
1. Manually finding an instrumental variable restricts researchers to single-source events (e.g. weather or lottery)
2. Still no guarantee that either Exclusion or As-if-random is satisfied.
34
Causal data mining: Inverting the natural experiment paradigm
Hypothesize about a natural variation
Argue why it resembles a randomized experiment
Observational DataDevelop tests for validity of natural
variation
Mine for data subsets with such
valid variations
¿ 𝑋 ,𝑌>¿
35
Part I: Split-door criterion for causal identification
36
Intuition: What if we can observe an auxiliary outcome that is unaffected by causal variable?
Cause Outcome
Unobserved
Confounders
Auxiliary
Outcome
Outcome can be separated into two observable parts:
i) Primary outcome: (possibly) affected by cause
ii) Auxiliary outcome: unaffected by cause
37
Intuition: What if we can observe an auxiliary outcome that is unaffected by causal variable?
CausePrimary Outcom
e
Unobserved
Confounders
Auxiliary
Outcome
Intuition: What if we can observe an auxiliary outcome that is unaffected by causal variable?
Cause Outcome
Unobserved
Confounders
Auxiliary
Outcome
Intuition: What if we can observe an auxiliary outcome that is unaffected by causal variable?
Cause Outcome
Unobserved
Confounders
Auxiliary
Outcome
40
Simplest case: Outcome can be separated into two observable parts
i) Primary outcome: (possibly) affected by cause
ii) Auxiliary outcome: unaffected by cause
41
Such outcome data commonly available in digital systemsRecommender systemsAd systemsApp notificationsAny content website (such as news)
Let’s take a concrete example: recommender systems
42
Can we find such an auxiliary outcome ()?
43
Example: Estimating the causal impact of a recommender system (novel recommendations)
44
How much activity comes from the recommendation system?
30% of product page visits.
30% of groups joined.
80% of movies watched.
Sharma and Yan (2013), Sharma, Hofman and Watts (2015), Gomez and Hunt (2015)
Confounding: Observed click-throughs may be due to correlated demand
45
Demand for The Road
Visits to The Road
Rec. visits to No
Country for Old
Men
Demand for No Country for Old Men
Correlated Demand for Cormac McCarthy
46
Observed activity is almost surely an overestimate of the causal effect
Causal
Convenience
OBSERVED ACTIVITY
FROM RECOMMENDER
All page visits
?
ACTIVITY WITHOUT
RECOMMENDER
47
Counterfactual thought experiment: What would have happened without recommendations?
48
Hypothetical experiment: Randomized A/B test
But such experiments can be costly.Can we develop an offline metric?
Treatment (A) Control (B)
49
Past work: traditional instrumental variable
Instrument
Demand for Cormac
McCarthy
Visits to The Road
Rec. visits to No
Country for Old
Men
Carmi et al. (2012)
Data mining approach (Shock-IV): Finding valid shocks across product categories
50
Shock to demand of a product due to
Oprah
Argue why it resembles a randomized experiment
¿ 𝑋 ,𝑌>¿ Develop tests for validity of a shock
Mine for shocks in observational data
¿ 𝑋 ,𝑌>¿
Finding auxiliary outcome: Split outcome into recommender (primary) and direct visits (auxiliary)
51
All visits to a recommended
product
Recommender visits Direct visits
Search visits
Direct browsing
Auxiliary outcome: Proxy for unobserved demand
52
Causal graphical model for effect of a recommendation system
Demand for focal
product(UX)
Visits to focal
product (X)Rec. visits
(YR)Direct
visits (YD
Demand for rec.
product(UY)
? ?
1a. Search for any product with a shock to page visits
53
1b. Filtering out invalid natural experiments
54
55
The “split-door” criterionTest if auxiliary outcome is independent of the cause. Criterion:
Exclusion
Demand for focal
product(UX)
Visits to focal
product (X)Rec. visits
(YR)Direct
visits (YD
Demand for rec.
product(UY)
56
More formally, why does it work?
Theorem 1: Barring incidental equality of parameters, statistical independence of and guarantees unconfoundedness between and .Proof: Follows from properties of causal graphical models and Pearl’s do-calculus [Pearl 2009]
Unobserved variables
(UX)
Cause(X)
Outcome (YR)
Auxiliary Outcome
(YD
Unobserved variables
(UY)
57
Example: Assuming a linear model
Theorem 1a: Whenever , and , then the unbiased causal estimate can be estimated as:
TreatmentOutcome: Unobserved confoundersCausal effectParameters
58
Relationship to instrumental variable techniqueBoth utilize naturally occurring variation in data.
Instrumental Variable Split-door criterionAssumption: Exclusion and As-if-random
Independence test used to find natural experiments.Only Assumption: Auxiliary outcome is affected by the causes of the primary outcome.
By testing if treatment is independent of auxiliary outcome, Split-door requires a weaker dependence
assumption for validity.
59
By testing if treatment is independent of auxiliary outcome, Split-door requires a weaker dependence
assumption for validity.
Treatment
Outcome
Unobserved
Confounders
Exclusion?
Instrumental Variable
Treatment
Outcome
Unobserved
Confounders
Auxiliary Outcome
Split-door criterion
Instrumental Variable
Data from Amazon.com, using Bing toolbarAnonymized browsing logs (Sept 2013-May 2014)• 23 M pageviews • 2 M Bing Toolbar users• 1.3 M Amazon productsOut of which 20 K products have at least 10 visits on any one day
61
Constructed sequence of visits for each user
Search page Focal product pageRecommended product page
62
Recreating sequence of visits: Log data
Timestamp URL2014-01-20 09:04:10
http://www.amazon.com/s/ref=nb_sb_noss_1?field-keywords=Cormac%20McCarthy
2014-01-20 09:04:15
http://www.amazon.com/dp/0812984250/ref=sr_1_2
2014-01-20 09:05:01
http://www.amazon.com/dp/1573225797/ref=pd_sim_b_1
63
Recreating sequence of visits: Log data
Timestamp
URL
2014-01-20 09:04:10
http://www.amazon.com/s/ref=nb_sb_noss_1?field-keywords=Cormac%20McCarthy
2014-01-20 09:04:15
http://www.amazon.com/dp/0812984250/ref=sr_1_2
2014-01-20 09:05:01
http://www.amazon.com/dp/1573225797/ref=pd_sim_b_1
User searches for Cormac McCarthy
User clicks on the second search result
User clicks on the first recommendation
I. Weekly and seasonal patterns in traffic, nearly tripling in holidays
65
II. 30% of pageviews come from recommendations
III. Books and eBooks are the most popular categories by far
67
Implementing the split-door criterion
¿ 𝑋 ,𝑌 𝐷>¿
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
Nat. Expt.
days
𝑥(2) , 𝑦𝐷(2)
𝑥(𝑖) , 𝑦𝐷(𝑖)
𝑥(1) , 𝑦𝐷(1 )
𝑥(𝑛−1) , 𝑦𝐷(𝑛−1 )
𝑥(𝑛) , 𝑦𝐷(𝑛)
Causal effect
68
Implementing the split-door criterion1. Divide up data into t=15 day periods.
2. For each time period: a) Using Fisher’s test, find product pairs (X and Y) such that:
Visits to focal product: Direct visits to recommended product
b) Compute
69
Using the split-door criterion, obtain 23,000 natural experiments for over
12,000 products.1) Traditional IV method using Oprah Winfrey
[Carmi et al.]: 133 natural experiments
2) Covers more than half of all products~20k
70
VALID
INVALID
71
Observational click-through rate overestimates causal effect
Over half of the recommendation click-throughs would have happened anyways.
72
Can vary the confidence in validity of obtained natural experiments
73
74
Similar, more precise causal estimates than simply using shocks
75
Generalization? Distribution of products with a natural experiment identical to overall distribution
Causal estimates are consistent with experimental findings (e.g., Belluf et. al. [2012], Lee
and Hosanager [2014])
• Shocks may be due to discounts or sales
Generalizable to all products on amazon.com?
76
Lower CTR may be due to the holiday
season
• Split-door products are not a representative sample of all products, nor are the users who participate in them.
• But Split-door criterion covers more than half of all products with at least 10 visits on any single day.
• Causal estimates are consistent with experimental findings (e.g., Belluf et. al. [2012], Lee and Hosanager [2014])
Generalization to all of Amazon.com?
77
78
Potential applications: Whenever an auxiliary outcome is availableDigital systemsRecommender systems, ad systems, app notificationsAny media website or app (such as newspapers)Offline contextsDiscount mailers sent by storesAny two marketing channelsIn the future…Effect of medical treatments, teaching interventions, etc.
79
Summary: Mining natural experiments at scaleUnlike traditional natural experiments, Split-door criterion relies on fine-grained data to:
Verify exclusion assumption [Robustness]Cover a broad range of data
[Generalizability]
Provides an offline metric for computing causal effects in digital systems (e.g., ad systems, media websites, app notifications).Code available for use.
Oprah [Carmi et al.] 133 shocks Restricted to books
Split-door criterion
12,000 natural experiments
Representative of overall product distribution
80
Nat. Exp.
The spectrum: split-door, regression and a natural experiment
Cutoff for Likelihood of Independence
0 .80 .95 1
Split-door
Nat. Exp.
Nat. Exp.
Regression
Amou
nt o
f dat
a
81
Part 2: A general Bayesian test for natural experiments in any dataset
Cause (X)
Outcome (Y)
Unobserved Confounders
(U)
Instrumental Variable
(Z)
As-If-Random?
Exclusion?Given observed data, can we determine whether it was generated from, a) the above model class (Invalid-IV), or
b) a model class without red edges (Valid-IV)?
83
Observational Data
Cause (X)
Outcome (Y)
Unobserved
Confounders (U)
I.V.(Z)
(X)
(Y)
(U)
(Z) (X
)(Y)
(U)
(Z)
(X)
(Y)
(U)
(Z)
𝑦= 𝑓 (𝑥 ,𝑢)
𝑦= 𝑓 (𝑥 ,𝑧 ,𝑢)
84
Necessary test: By properties of causal graph
Test :
Pearl (1993)
Cause (X)
Outcome (Y)
Unobserved
Confounders (U)
I.V.(Z)
85
But we would like a sufficient test for instrumental variables.
86
A first try: Compare model classes by maximum likelihood
Every data distribution that can be generated by a ValidIV model can also be generated by an InvalidIV model.
𝑀𝐿 𝐼𝑛𝑣𝑎𝑙𝑖𝑑𝐼𝑉= max𝑚 ′∈𝐼𝑛𝑣𝑎𝑙𝑖𝑑𝐼𝑉
𝑃 (𝐷𝑎𝑡𝑎∨𝑚′ )
Diamond represents all observable probability distributions P(X,Y|Z).
Sufficiency is almost “impossible”
87
Passes Necessary test
Both Valid and Invalid IV models can generate this data distribution.
Can attain a weaker notion: probable sufficiency
88
𝑉𝑎𝑙𝑖𝑑𝑖𝑡𝑦𝑅𝑎𝑡𝑖𝑜=𝑃 (𝑉𝑎𝑙𝑖𝑑𝐼𝑉∨𝐷𝑎𝑡𝑎)𝑃 (𝐼𝑛𝑣𝑎𝑙𝑖𝑑𝐼𝑉∨𝐷𝑎𝑡𝑎)
A “probably sufficient” criterion
89
Intuition
(X)
(Y)
(U)
(Z)
(X)
(Y)
(U)
(Z)
Valid-IV
𝑉𝑎𝑙𝑖𝑑𝑖𝑡𝑦𝑅𝑎𝑡𝑖𝑜=𝑃 (𝑉𝑎𝑙𝑖𝑑𝐼𝑉∨𝐷𝑎𝑡𝑎)𝑃 (𝐼𝑛𝑣𝑎𝑙𝑖𝑑𝐼𝑉∨𝐷𝑎𝑡𝑎)
(X)
(Y)
(U)
(Z)
(X)
(Y)
(U)
(Z)
Invalid-IV
Observational Data
Observational Data
𝑔1 𝑔2
𝑔3
𝑓 1
𝑓 3
𝑓 2
𝑓 4𝑔4h3 h4
Develop a generative meta-model of the data.
Compare marginal likelihoods of Valid versus Invalid-IV models.
Can formalize as a Bayesian model comparison
90
Data is likely to be generated from a Valid-IV model if ValidityRatio ≫ 1
91
Computing the Validity Ratio
Two problems:Each causal model contains unobserved
variable U.Infinitely many causal models in each
sub-class.
92
I. Use a response variable frameworkAssumes discrete variables.
𝑦= 𝑓 (𝑥 ,𝑢)
93
II. Non-standard integral over infinite models
Denominator (Invalid-IV)
Derived a closed form solution.
Properties of dirichlet and hyperdirichlet distributions.
-Laplace transform
Numerator (Valid-IV)
No closed form solution exists.
Used Monte Carlo methods for approximating.
-Annealed Importance Sampling
∫❑ ∫❑
94
Use the NPS test to validate IV studies from American Economic ReviewCollected studies from American Economic Review (AER) with “instrumental variable” in title or abstract.
95
Many recent studies from American Economic Review do not pass the testCollected studies from American Economic Review (AER) with “instrumental variable” in title or abstract. Studies from American Economic Review Validity
RatioEffect of Mexican immigration on crime in United States (2015)
0.07Effect of subsidy manipulation on Medicare premiums (2015)
1.02Effect of credit supply on housing prices (2015) 0.01Effect of Chinese import competition on local labor markets (2013)
0.3Effect of rural electrification on employment in South Africa (2011)
3.6
Expt: National Job Training Partnership Act (JTPA) Study (2002)
3.4
Challenges decades-long belief that causal assumptions cannot be tested from data
Can use data mining for causal effects in large-scale data.
Two recipes:• Create new graphical structures that identify
causal effect: Split-door criterion• Use Bayesian modeling to test instrumental
variables: NPS test
Conclusion: Causal data mining enables causal inference from large-scale data
96
97
More generally, a viable methodology for causal inference in large datasets
¿ 𝑋 ,𝑌>¿ Develop tests for validity of natural
variation
Mine for such valid variations in
observational data
98
LotteryWeatherShocks
Hard-to-find variations
Discontinuities
Change in access of digital services
Change in medicines at a hospital
Change in train stops in a city
…
More generally, a viable methodology for causal inference in large datasets
99
Controlled experiments
IV Test
Future Work
Ability to experiment
Amou
nt o
f dat
a
1010
108
106
104
102
Contextual BanditsA/B
test
Split-door
Causal algorithms
Warm Start (choosing expts.)
Online+Offline
100
Future work: Causal inference and machine learning
Causal inference robust prediction
Causal inferencePredicted value under the counterfactual distribution P’(X,y).
(Supervised) MLPredicted value under the training distribution P(X,y).
101
Thank you!Amit Sharmahttp://www.amitsharma.in1. Hofman, Sharma, and Watts (2017).
Prediction and explanation in social systems. Science, 355.6324.
2. Sharma (2016). Necessary and probably sufficient test for finding instrumental variables. Working paper.
3. Sharma, Hofman, and Watts (2016). Split-door criterion for causal identification: An algorithm for finding natural experiments. Under review at Annals of Applied Statistics (AOAS).
4. Sharma, Hofman, and Watts (2015). Estimating the causal impact of recommendation systems from observational data. In Proceedings of the 16th ACM Conference on Economics and Computation.
102
References1. Angrist and Pischke (2008). Mostly harmless econometrics:
An empiricist’s companion. Princeton Univ. Press.2. Belluf, Xavier and Giglio (2012). Case study on the business
value impact of personalized recommendations on a large online retailer. In Proc. ACM Conf. on Recommender Systems.
3. Carmi, Oestreicher-Singer and Sundararajan (2012). Is Oprah contagious? Identifying demand spillovers in online networks. SSRN 1694308
4. Dunning (2012). Natural experiments in the social sciences: a design-based approach. Cambridge University Press
5. Gomez-Uribe and Hunt (2015). The Netflix recommender system: Algorithms, business value and innovation. ACM Transactions on Management Information Systems.
6. Lee and Hosanager (2014). When do recommender systems work the best? The moderating effects of product attributes and consumer reviews on recommender performance. In Proc. ACM World Wide Web Conference.
103
References7. Lin, Goh and Heng (2013). The demand effects of product
recommendation networks: An empirical analysis of network diversity and stability. SSRN 2389339.
8. Linden, Smith and York (2003). Amazon. com recommendations: Item to-item collaborative filtering. IEEE Internet Computing.
9. Mulpuru (2006). What you need to know about Third-Party Recommendation Engines. Forrester Research.
10. Oestreicher-Singer and Sundararajan (2012). The Visible Hand? Demand Effects of Recommendation Networks in Electronic Markets. Management Science.
11. Pearl (2009). Causality: models, reasoning and inference. Cambridge Univ Press.
12. Sharma and Yan (2013). Pairwise learning in recommendation: Experiments with community recommendation on Linkedin. In ACM Conf. on Recommender Systems.