S GraphS GraphS Graphs Specification Language Jose Domingo López López Ángel Escribano Santamarina.
Single World Intervention Graphs...
Transcript of Single World Intervention Graphs...
Single World Intervention Graphs (SWIGs):Unifying the Counterfactual and Graphical Approaches to Causality
Thomas Richardson
Department of Statistics
University of Washington
Joint work with James Robins (Harvard School of Public Health)
Therme Vals Causal Workshop
5 Aug 2013
Outline
Brief review of counterfactualsA new unification of graphs and counterfactuals vianode-splitting
I Factorization and ModularityI Contrast with Twin Network approachI Some Examples and ExtensionsI Sequentially Randomized Experiments / Time Dependent
ConfoundingI Dynamic Regimes
Experimental Testability and Independence of Errors inNPSEMs
Thomas Richardson Therme Vals Workshop Slide 1
Counterfactualsaka Potential Outcomes
Thomas Richardson Therme Vals Workshop Slide 2
The potential outcomes framework: philosophy
Hume (1748) An Enquiry Concerning Human Understanding:
We may define a cause to be an object followed by another, andwhere all the objects, similar to the first, are followed by objectssimilar to the second, . . .
. . . where, if the first object had not been the second never hadexisted.
Thomas Richardson Therme Vals Workshop Slide 3
The potential outcomes framework: crop trials
Jerzy Neyman (1923):To compare v varieties [on m plots] we will consider numbers:
U11 . . . U1m...
...Uv1 . . . Uvm
Here Uij is the crop yield that would be observed if variety i wereplanted in plot j.
Physical constraints only allow one variety to be planted in a givenplot in any given growning season.
Popularized by Rubin (1974); sometimes called the ‘Rubin causal model’.
Thomas Richardson Therme Vals Workshop Slide 4
Potential outcomes with binary treatment
For binary treatment X and response Y, we define two potentialoutcome variables:
Y(x = 0): the value of Y that would be observed for a givenunit if assigned X = 0;
Y(x = 1): the value of Y that would be observed for a givenunit if assigned X = 1;
WIll also write these as Y(x0) and Y(x1).Implicit here is the assumption that these outcomes arewell-defined. Specifically:
I Only one version of treatment X = xI No interference between units (SUTVA).
Thomas Richardson Therme Vals Workshop Slide 5
Potential Outcomes
Unit Potential Outcomes ObservedY(x = 0) Y(x = 1) X Y
1 0 12 0 13 0 04 1 15 1 0
Thomas Richardson Therme Vals Workshop Slide 6
Drug Response ‘Types’:
In the simplest case where Y is a binary outcome we have thefollowing 4 types:
Y(x0) Y(x1) Name0 0 Never Recover0 1 Helped1 0 Hurt1 1 Always Recover
Thomas Richardson Therme Vals Workshop Slide 7
Assignment to Treatments
Unit Potential Outcomes ObservedY(x = 0) Y(x = 1) X Y
1 0 1 12 0 1 03 0 0 14 1 1 15 1 0 0
Thomas Richardson Therme Vals Workshop Slide 8
Observed Outcomes from Potential Outcomes
Unit Potential Outcomes ObservedY(x = 0) Y(x = 1) X Y
1 0 1 1 12 0 1 0 03 0 0 1 04 1 1 1 15 1 0 0 1
Thomas Richardson Therme Vals Workshop Slide 9
Potential Outcomes and Missing Data
Unit Potential Outcomes ObservedY(x = 0) Y(x = 1) X Y
1 ? 1 1 12 0 ? 0 03 ? 0 1 04 ? 1 1 15 1 ? 0 1
Thomas Richardson Therme Vals Workshop Slide 10
Average Causal Effect (ACE) of X on Y
ACE(X→ Y) ≡ E[Y(x1) − Y(x0)]
= p(Helped) − p(Hurt) ∈ [−1, 1]
Thus ACE(X→ Y) is the difference in % recovery ifeveryone treated (X = 1) vs. if noone treated (X = 0).
Thomas Richardson Therme Vals Workshop Slide 11
Identification of the ACE under randomization
If X is assigned randomly then
X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1) (1)
hence
E[Y(x1) − Y(x0)] = E[Y(x1)] − E[Y(x0)]
= E[Y(x1) | X = 1] − E[Y(x0) | X = 0]
= E[Y | X = 1] − E[Y | X = 0].
Thus if (1) holds then ACE(X→ Y) is identified from P(X, Y).
Thomas Richardson Therme Vals Workshop Slide 12
Inference for the ACE without randomizationSuppose that we do not know that X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1).What can be inferred?
X = 0 X = 1Placebo Drug
Y = 0 200 600Y = 1 800 400
What is:
The largest number of people who could be Helped?400 + 200
The smallest number of people who could be Hurt? 0
⇒ Max value of ACE: (200 + 400)/2000 − 0 = 0.3
Similar logic:
⇒ Min value of ACE: 0 − (600 + 800)/2000 = −0.7
Thomas Richardson Therme Vals Workshop Slide 13
Inference for the ACE without randomization
Suppose that we do not know that X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1).
General case:
−(P(x=0,y=1) + P(x=1,y=0)) 6 ACE(X→ Y)
ACE(X→ Y) 6 P(x=0,y=0) + P(x=1,y=1)
⇒ Bounds will always cross zero.
⇒ X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1) essential for drawing non-trivialcausal inferences.
Thomas Richardson Therme Vals Workshop Slide 14
Summary of Counterfactual Approach
In our observed data, for each unit one outcome will be‘actual’; the others will be ‘counterfactual’.
The potential outcome framework allowsCausation to be ‘reduced’ to Missing Data⇒ Conceptual progress!
The ACE is identified if X ⊥⊥ Y(xi) for all values xi.
Randomization of treatment assignment implies X ⊥⊥ Y(xi).
Ideas are central to Fisher’s Exact Test; also many parts ofexperimental design.
The framework is the basis of many practical causal dataanalyses published in Biostatistics, Econometrics andEpidemiology.
Thomas Richardson Therme Vals Workshop Slide 15
Relating Counterfactuals and Structural Equations
Potential outcomes can be seen as a different notation forNon-Parametric Structural Equation Models (NPSEMs):
Example: X→ Y.
NPSEM formulation: Y = f(X, εY)
Potential outcome formulation: Y(x) = f(x, εY)
Two important caveats:
NPSEMs typically assume all variables are seen as beingsubject to well-defined interventions (not so with potentialoutcomes)
Pearl associates NPSEMs with Independent Errors(NPSEM-IEs) with DAGs (more on this later).
Thomas Richardson Therme Vals Workshop Slide 16
Relating Counterfactuals and ‘do’ notation
Expressions in terms of ‘do’ can be expressed in terms ofcounterfactuals:
P(Y(x) = y) ≡ P(Y = y | do(X = x))
but counterfactual notation is more general. Ex. Distribution ofoutcomes that would arise among those who took treatment(X = 1) had counter-to-fact they not received treatment:
P(Y(x = 0) = y | X = 1)
If treatment is randomized, so X ⊥⊥ Y(x = 0) then this equalsP(Y(x = 0) = y), but in an observational study these may bedifferent.
Thomas Richardson Therme Vals Workshop Slide 17
Graphs
Thomas Richardson Therme Vals Workshop Slide 18
Recap: Graphical Approach to Causality
X Y
No Confounding
X
H
Y
Confounding
Unobserved
Graph intended to represent direct causal relations.
Convention that confounding variables (e.g. H) are always includedon the graph.
Approach originates in the path diagrams introduced by SewallWright in the 1920s.
If X→ Y then X is said to be a parent of Y; Y is child of X.
Thomas Richardson Therme Vals Workshop Slide 19
Edges are directed, but are they causal?
X Y
P(X, Y) = P(X)P(Y | X)
No Confounding
X Y
P(X, Y) = P(Y)P(X | Y)
No Confounding
Neither factorization places any restriction on P(X, Y).
Thomas Richardson Therme Vals Workshop Slide 20
Linking the two approaches
X Y
X ⊥⊥ Y(x0) & X ⊥⊥ Y(x1)
X
H
Y
X 6⊥⊥ Y(x0) & X 6⊥⊥ Y(x1)
Unobserved
Elephant in the room:The variables Y(x0) and Y(x1) do not appear on thesegraphs!!
Thomas Richardson Therme Vals Workshop Slide 21
Node splitting: Setting X to 0
X Y
P(X= x, Y= y) = P(X= x)P(Y= y | X= x)
⇒ X x = 0 Y(x = 0)
Can now ‘read’ the independence: X ⊥⊥ Y(x=0).Also associate a new factorization:
P (X= x, Y(x=0)= y) = P(X= x)P (Y(x=0)= y)
where:P (Y(x=0)= y) = P(Y= y |X=0).
This last equation links a term in the original factorization to thenew factorization. We term this the ‘modularity assumption’.
Thomas Richardson Therme Vals Workshop Slide 22
Node splitting: Setting X to 1
X Y
P(X= x, Y= y) = P(X= x)P(Y= y | X= x)
⇒ X x = 1 Y(x = 1)
Can now ‘read’ the independence: X ⊥⊥ Y(x=1).Also associate a new factorization:
P (X= x, Y(x=1)= y) = P(X= x)P (Y(x=1)= y)
where:P (Y(x=1)=y) = P(Y=y |X=1).
Thomas Richardson Therme Vals Workshop Slide 23
Crucial point: Y(x=0) and Y(x=1) are never on the same graph.Although we have:
X ⊥⊥ Y(x=0) and X ⊥⊥ Y(x=1)
we do not haveX ⊥⊥ Y(x=0), Y(x=1)
Had we tried to construct a single graph containing both Y(x=0)and Y(x=1) this would have been impossible. (Why?)
⇒ Single-World Intervention Graphs (SWIGs).
Thomas Richardson Therme Vals Workshop Slide 24
Representing both graphs via a ‘template’
X Y
G
⇒ X x Y(x)
G(x)
Represent both graphs via a template:
Formally this is a ‘graph valued function’:
Takes as input a specific value x∗
Returns as output a SWIG G(x∗).
Each instantiation of the template is a SWIG G(x∗) that representsa different margin: P(X, Y(x∗)) with red nodes x∗ becomingconstants.
Thomas Richardson Therme Vals Workshop Slide 25
Intuition behind node splitting:(Robins, VanderWeele, Richardson 2007)
Q: How could we identify whether someone would choose to taketreatment, i.e. have X = 1, and at the same time find out whathappens to such a person if they don’t take treatment Y(x = 0)?
A: Consider an experiment in which, whenever a patient isobserved to swallow the drug have X = 1, we instantly interveneby administering a safe ‘emetic’ that causes the pill to beregurgitated before any drug can enter the bloodstream.Since we assume the emetic has no side effects, the patient’srecorded outcome is then Y(x = 0).
Thomas Richardson Therme Vals Workshop Slide 26
Harder Inferential problem
X0 Z
H X1
Y
Query: does this causal graph imply?
Y(x0, x1) ⊥⊥ X1(x0) | Z(x0),X0,
Thomas Richardson Therme Vals Workshop Slide 27
Simple solution
X0 Z
H X1
Y
X0
x0
Z(x0)
H
X1(x0)
x1
Y(x0, x1)
Query does this graph imply:
Y(x0, x1) ⊥⊥ X1(x0) | Z(x0),X0 ?
Answer: Yes – applying d-separation to the SWIG on the right wesee that there is no d-connecting path from Y(x0, x1) given Z(x0).More on this shortly...
Thomas Richardson Therme Vals Workshop Slide 28
Single World Intervention Template Construction (1)Given a graph G, a subset of vertices A = {A1, . . . ,Ak} to be intervenedon, we form G(a) in two steps:
(1) (Node splitting): For every A ∈ A split the node into a randomnode A and a fixed node a:
A
· · ·
· · ·
⇒ A
a
Splitting: Schematic Illustrating the Splitting of Node A
The random half inherits all edges directed into A in G;
The fixed half inherits all edges directed out of A in G.
Thomas Richardson Therme Vals Workshop Slide 29
Single World Intervention Template Construction (2)
(2) Relabel descendants of fixed nodes:
a ⇒A
B C
D
FE
X
T
Y
Z
· · ·
· · ·
· · ·
· · ·
a
A(. . .)
B(a, . . .) C(a, . . .)
D(a, . . .)
F(a, . . .)E(a, . . .)
X(. . .)
T(. . .)
Y(. . .)
Z(. . .)
· · ·
· · ·
· · ·
· · ·
Thomas Richardson Therme Vals Workshop Slide 30
Single World Intervention Graph
A Single World Intervention Graph (SWIG) G(a∗) is obtained fromthe Template G(a) by simply substituting specific values a∗ for thevariables a in G(a);
For example, we replace G(x) with G(x=0).
Changing the value of a fixed variable corresponds toconstructing a new graph and considering a differentpopulation, e.g. P(X, Y(x=0)) vs. P(X, Y(x=1))
It is only the instantiated graph G(x) that represents P(V(x)),not the template G(x).
Thomas Richardson Therme Vals Workshop Slide 31
Factorization and Modularity
Factorization: P(V(a)) over the counterfactual variables in G(a)factorizes with respect to G(a) (ignoring fixed nodes):
P (V(a)) =∏
Y(a)∈V(a)P(Y(a)
∣∣∣ paG(a)(Y(a)) \ a)
.
Modularity: P(V(a)) and P(V) are linked as follows:
P(Y(a)=y
∣∣∣ (paG(a)(Y(a)) \ a)= q
)= P
(Y=y
∣∣∣ (paG(Y) \A)= q,
(paG(Y) ∩A
)= apaG(Y)∩A
),
So the conditional density associated with Y(aY) in G(a) is just theconditional density associated with Y in G after substituting ai for anyAi ∈ A that is a parent of Y.
Thomas Richardson Therme Vals Workshop Slide 32
Applying d-separation to the graph G(a)
Counterfactual conditional independence relations may beobtained from the transformed graph by applying d-separation afteradding fixed nodes to the conditioning set:
Given disjoint subsets B(a), C(a) and D(a) of random vertices(where D(a) may be empty),
if B(a) is d-separated from C(a) given D(a) ∪ a in G(a) (2)
then B(a) ⊥⊥ C(a) | D(a) [P(V(a))].
In words, if in G(a) two subsets B(a) and C(a) of random nodes ared-separated by D(a) in conjunction with the fixed nodes a, then B(a) andC(a) are conditionally independent given D(a) in the associateddistribution P(V(a)).
Thomas Richardson Therme Vals Workshop Slide 33
Conditioning on fixed variables a
intuitive since these are fixed constants in the SWIG
Since vertices in a have no parents, no new paths d-connectdue to also conditioning on a.
⇒ If a d-separation holds in G(a) without conditioning on thefixed nodes, then it will continue to hold if we also condition onfixed nodes.
An alternative is simply to restrict attention to paths that donot contain fixed vertices,
e.g. remove fixed nodes from the graph before checkingd-separation.
Thomas Richardson Therme Vals Workshop Slide 34
Mediation graph (I)Intervention on Z alone.
X YZ X(z)Z z Y(z)⇒
factorization:
P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z))
modularity:
P(X(z)=x) = P(X=x | Z= z),
P(Y(z)=y | X(z)=x) = P(Y=y | X=x,Z= z).
d-separation gives:Z ⊥⊥ X(z), Y(z).
Thomas Richardson Therme Vals Workshop Slide 35
Mediation graph (II)Intervention on Z and X:
X YZ X(z) xZ z Y(x, z)⇒
factorization:
P(Z,X(z), Y(x, z)) = P(Z)P(X(z))P(Y(x, z))
modularity:
P(X(z)=x) = P(X=x | Z= z),
P(Y(x, z)=y) = P(Y=y | X= x,Z= z).
d-separation gives:
Z ⊥⊥ X(z) ⊥⊥ Y(x, z)
Thomas Richardson Therme Vals Workshop Slide 36
No direct effect graph
X YZ X(z) xZ z Y(x)⇒
factorization:
P(Z,X(z), Y(x)) = P(Z)P(X(z))P(Y(x))
modularity:
P(X(z)=x) = P(X=x | Z= z),
P(Y(x)=y) = P(Y=y | X= x).
d-separation gives:
Z ⊥⊥ X(z) ⊥⊥ Y(x)
Thomas Richardson Therme Vals Workshop Slide 37
Inferential Problem (II)
X0 Z
H X1
Y
X0
x0
Z(x0)
H
X1(x0)
x1
Y(x0, x1)
Pearl (2009), Ex. 11.3.3, claims the causal DAG above does not imply:
Y(x0, x1) ⊥⊥ X1 | Z,X0 = x0. (3)
The SWIG shows that (3) does hold; Pearl is incorrect.Specifically, we see from the SWIG:
Y(x0, x1) ⊥⊥ X1(x0) | Z(x0),X0, (4)
⇒ Y(x0, x1) ⊥⊥ X1(x0) | Z(x0),X0 = x0. (5)
This last condition is then equivalent to (3) via consistency.(Pearl infers a claim of Robins is false since if true then (3) would hold).
Thomas Richardson Therme Vals Workshop Slide 38
Pearl’s twin network for the same problem
X0
Z
H
X1
Y
X0
Z
H
X1
Y
x0
Z(x0, x1)
H(x0, x1)
x1
Y(x0, x1)
UZ
UH
UY
The twin network fails to reveal that Y(x0, x1) ⊥⊥ X1 | Z,X0 = x0.This ‘extra’ independence holds in spite of d-connection because (byconsistency) when X0 = x0, then Z = Z(x0) = Z(x0, x1).Note that Y(x0, x1) 6⊥⊥ X1 | Z,X0 6= x0.
Shpitser & Pearl (2008) introduce a pre-processing step to address this.
Thomas Richardson Therme Vals Workshop Slide 39
Confounding Revisited
X x Y(x)
H
Here we can read directly from the template that X 6⊥⊥ Y(x) sincethere is a path:
X← H→ Y(x).
Thomas Richardson Therme Vals Workshop Slide 40
Adjusting for confounding
X Y
L
X x Y(x)
L
Here we can read directly from the template that
X ⊥⊥ Y(x) | L.
It follows that:
P(Y(x)=y) =∑l
P(Y=y | L= l,X= x)P(L= l). (6)
Thomas Richardson Therme Vals Workshop Slide 41
Contrast with approach via removing edges
X Y
L
X x Y(x)
L
X Y
L
This ‘explains’ why L is sufficient to control confounding under thenull (where X has no effect on Y) but not under the alternative.
Thomas Richardson Therme Vals Workshop Slide 42
Adjusting for confounding
X Y
L
X x Y(x)
L
X ⊥⊥ Y(x) | L.
Proof of identification:
P[Y(x) = y] =∑l
P[Y(x) = y | L = l]P(L = l)
=∑l
P[Y(x) = y | L = l,X = x]P(L = l) indep
=∑l
P[Y = y | L = l,X = x]P(L = l) modularity
Thomas Richardson Therme Vals Workshop Slide 43
More Examples (I)
X Y
L
H
(a-i)
X x Y(x)
L
H
(a-ii)
Here we can read directly from the template that
X ⊥⊥ Y(x) | L.
Thomas Richardson Therme Vals Workshop Slide 44
More Examples (II)
X Y
L
H
(b-i)
X x Y(x)
L
H
(b-ii)
Here we can read directly from the template that
X ⊥⊥ Y(x) | L.
Thomas Richardson Therme Vals Workshop Slide 45
Sequentially randomized experiment (I)
A B C D
H
A and C are treatments;
H is unobserved;
B is a time varying confounder;
D is the final response;
Treatment C is assigned randomly conditional on the observedhistory, A and B;
Want to know P(D(a, c)).
Thomas Richardson Therme Vals Workshop Slide 46
Sequentially randomized experiment (I)
A B C D
H
If the following holds:
A ⊥⊥ D(a, c)
C(a) ⊥⊥ D(a, c) | B(a),A
General result of Robins (1986) then implies:
P(D(a, c)=d) =∑b
P(B=b | A= a)P(D=d | A= a,B=b,C= c).
Does it??
Thomas Richardson Therme Vals Workshop Slide 47
Sequentially randomized experiment (II)
A a B(a) C(a) c D(a, c)
H
d-separation:
A ⊥⊥ D(a, c)
C(a) ⊥⊥ D(a, c) | B(a),A
General result of Robins (1986) then implies:
P(D(a, c)=d) =∑b
P(B=b | A= a)P(D=d | A= a,B=b,C= c).
Thomas Richardson Therme Vals Workshop Slide 48
Multi-network approach
A B C D
H
UHUB
UC
UD
a B(a) C(a) D(a)
H(a)
a B(a, c) c D(a, c)
H(a, c)
Thomas Richardson Therme Vals Workshop Slide 49
Another example
A B C D
H2H1
A ⊥⊥ D(a, c)
C(a) ⊥⊥ D(a, c) | B(a),A
General result of Robins (1986) then implies:
P(D(a, c)=d) =∑b
P(B=b | A= a)P(D=d | A= a,B=b,C= c).
Does it??
Thomas Richardson Therme Vals Workshop Slide 50
Another example
A B C D
H2H1
A
aB(a)
C(a)
cD(a, c)
H1 H2
A ⊥⊥ D(a, c)
C(a) ⊥⊥ D(a, c) | B(a),A
General result of Robins (1986) then implies:
P(D(a, c)=d) =∑b
P(B=b | A= a)P(D=d | A= a,B=b,C= c).
Thomas Richardson Therme Vals Workshop Slide 51
General result (Robins, 1986)Observed data:
O ≡ 〈L1,A1, . . . ,LK,AK, Y〉.
If the following holds for k = 1, . . . ,K
Y(a†) ⊥⊥ Ak(a†) | Lk(a
†),Ak−1(a†); (7)
then (under positivity):
P(Y(a†)=y |Lj(a†) = lj,Aj−1(a
†) = a†j−1)
=∑
lm+1,...,lK
p(y|lK, a†K)K∏
j=m+1
p(lj|lj−1, a†j−1). (8)
Here Aj−1(a†) ≡ 〈A1, . . . ,Aj−1(a
†j−2)〉, similarly for Lj−1(a
†).The RHS of (8) is referred to as the ‘g-formula’.
Thomas Richardson Therme Vals Workshop Slide 52
Dynamic regimes
A dynamic regime g is a policy that assigns treatment (usuallyat multiple time points) on the basis of past history;
Including conditional on the ‘natural’ value of treatment in theabsence of an intervention;
Exercise for as long as you would have done withoutintervention or twenty minutes, whichever is more.
See Young et al. (2012) for additional analysis.
Thomas Richardson Therme Vals Workshop Slide 53
Dynamic regimes
A1
a1
L(a1) A2(a1)
a2
Y(a1,a2)
H2H1
A1
A+1 (g)
L(g) A2(g)
A+2 (g)
Y(g)
H2H1
P(Y(g)) is identified.Thomas Richardson Therme Vals Workshop Slide 54
Dynamic regimes
A1
a1
L(a1) A2(a1)
a2
Y(a1,a2)
H2H1
A1
A+1 (g)
L(g) A2(g)
A+2 (g)
Y(g)
H2H1
P(Y(g)) is not identified.Thomas Richardson Therme Vals Workshop Slide 55
Joint Independence
We saw earlier that the causal DAG X→ Y implied:
X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1)
However, joint independence relations such as:
X ⊥⊥ Y(x0), Y(x1)
never follow from our SWIG transformation:There is no way via node-splitting to construct a graph with bothY(x0), and Y(x1).This has important consequences for the identification of directeffects.
Thomas Richardson Therme Vals Workshop Slide 56
Assuming Independent Errors and
Cross-World Independence
Thomas Richardson Therme Vals Workshop Slide 57
Mediation graphIntervention on X and M:
M YX M(x) mX x Y(x, m)⇒
d-separation gives:
X ⊥⊥ M(x) ⊥⊥ Y(x, m)
Pearl associates additional independence relations with this graph
Y(x1,m) ⊥⊥ M(x0),X
Y(x0,m) ⊥⊥ M(x1),X
equivalent to assuming independent errors, εX ⊥⊥ εM ⊥⊥ εY .Thomas Richardson Therme Vals Workshop Slide 58
Pure Direct Effect
Pure (aka Natural) Direct Effect (PDE): Change in Y had X beendifferent, but M fixed at the value it would have taken had X notbeen changed:
PDE ≡ Y(x1,M(x0)) − Y(x0,M(x0)).
Legal motivation [from Pearl (2000)]:
“The central question in any employment-discrimination case is whetherthe employer would have taken the same action had the employee beenof a different race (age, sex, religion, national origin etc.) and everythingelse had been the same.” (Carson versus Bethlehem Steel Corp., 70FEP Cases 921, 7th Cir. (1996)).
Thomas Richardson Therme Vals Workshop Slide 59
Decomposition
PDE also allows non-parametric decomposition of Total Effect(ACE) into direct (PDE) and indirect (TIE) pieces.
PDE ≡ E [Y(1,M(0))] − E [Y(0)]
TIE ≡ E [Y (1,M(1)) − Y (1,M(0))]
TIE+ PDE ≡ E [Y(1)] − E [Y(0)] ≡ ACE(X→ Y)
Thomas Richardson Therme Vals Workshop Slide 60
Pearl’s identification claim
Pearl and others claim that under “no confounding” the PDE isidentified by the following mediation formula:
PDEmed(m) =∑m
[E[Y|x1,m] − E[Y|x0,m]]P(m|x0)
Thomas Richardson Therme Vals Workshop Slide 61
Critique of PDE: Hypothetical Case Study
Observational data on three variables:
X- treatment: cigarette cessation
M intermediate: blood pressure at 1 year, high or low
Y outcome: say CHD by 2 years
Observed data (X,M, Y) on each of n subjects.
All binary
X randomly assigned
Thomas Richardson Therme Vals Workshop Slide 62
Hypothetical Study (I): X randomized
Y = 0 Y = 1 Total P(Y=1 |m, x)
M = 0 1500 500 2000 0.25X = 0
M = 1 1200 800 2000 0.40
M = 0 948 252 1200 0.21X = 1
M = 1 1568 1232 2800 0.44
A researcher, Prof H wishes to apply the mediation formula toestimate the PDE.Prof H believes that there is no confounding, so that Pearl’sNPSEM-IE holds, but his post-doc, Dr L is skeptical.
Thomas Richardson Therme Vals Workshop Slide 63
Hypothetical Study (II): X and M Randomized
To try to address Dr L’s concerns, Prof H carries out animalintervention studies.
Y = 0 Y = 1 Total P(Y(m, x)=1)M = 0 750 250 1000 0.25
X = 0M = 1 600 400 1000 0.40
M = 0 790 210 1000 0.21X = 1
M = 1 560 440 1000 0.44
As we see: P(Y(m, x)=1) = P(Y=1 |m, x);Prof H is now convinced: ‘What other experiment could I do ?’
He applies the mediation formula, yielding PDEmed
= 0.Conclusion: No direct effect of X on Y.
Thomas Richardson Therme Vals Workshop Slide 64
Failure of the mediation formula
Under the true generating process, the true value of the PDE is:
PDE = 0.153 6= PDEmed
= 0
Prof H’s conclusion was completely wrong!
Thomas Richardson Therme Vals Workshop Slide 65
Why did the mediation formula go wrong?
Dr L was right – there was a confounder:
M YX
H
but. . . it had a special structure so that:
Y ⊥⊥ H |M,X = 0 and M ⊥⊥ H | X = 1
Thomas Richardson Therme Vals Workshop Slide 66
Why did the mediation formula go wrong?
Dr L was right – there was a confounder:
M YX
H
but. . . it had a special structure so that:
Y ⊥⊥ H |M,X = 0 and M ⊥⊥ H | X = 1
M Y
HX = 0
M Y
HX = 1
The confounding undetectable by any intervention on X and/or M.
Pearl: Onus is on the researcher to be sure there is no confounding.Causation should precede intervention.
Thomas Richardson Therme Vals Workshop Slide 67
PDE identification cannot be checked via experimentIf our only interventions are on the variables X and M then wecannot do an experiment to learn the PDE.
We could learn E [Y{x = 1,M(x = 0)}] by intervention if we couldI intervene and set X to 0 and observe M(0),I then return each subject to their pre-intervention state,I finally intervene to set X to 1 and M to M(0) and observeY(1,M(0)).
Such an intervention strategy will usually not exist because notpossible in a real-world intervention (e.g., suppose the outcome Ywere death).
Because we cannot observe the same subject under both X = 1and X = 0 (i.e. ”across worlds”,) no intervention will allow us tolearn the distribution of mixed counterfactuals such asY{x = 1,M(x = 0)} :
(In the story Dr L had to introduce a new node on the graph in order tocheck the value of the PDE via an experiment.)
Thomas Richardson Therme Vals Workshop Slide 68
Summary of critique of Independent Error AssumptionThe independent error assumption cannot be checked by anyrandomized experiment on the variables in the graph.
⇒ Connection between experimental interventions and potentialoutcomes, established by Neyman has been severed;
⇒ Theories in Social and Medical sciences are not detailed enough tosupport the independent error assumption.
What about faithfulness and causal discovery procedures?
Such inferences are explicit that they rely on faithfulness, and aredesigned to guide hypothesis formation;
I Contrast: In Pearl’s NPSEM-IE approach the simple act ofusing a DAG is viewed as automatically committing you tomaking this untestable hypothesis.
Predictions (possibly derived assuming faithfulness) regardingintervention distributions P(Y(x)) = P(Y | do(x)) can be tested byrandomized experiments.
Thomas Richardson Therme Vals Workshop Slide 69
How many experimentally untestable assumptions?Assumption of independent errors implies super-exponentiallymany ‘cross-world’ counterfactual independence assumptions:
No. Actual Vars. 2 3 4 K
Dim. P(V) 3 7 15 2K − 1No. Cnterfactual Vars. 3 7 15 2K − 1
Dim. Cnterfactual Dist. 7 127 32767 2(2K−1) − 1
Dim. SWIG 5 113 32697 (2(2K−1) − 1) −∑K−1
j=1 (4j − 2j)
Dim. NPSEM-IE 4 19 274∑K−1
j=0 (22j
− 1)
No. untestable indep. 1 94 32423 O(22K−2)constrnts in NPSEM-IE
Table: Dimensions of counterfactual models associated with completegraphs with binary variables.
Thomas Richardson Therme Vals Workshop Slide 70
SWIG Completeness Conjecture
In an NPSEM we define a counterfactual independence to belogical if it holds regardless of the distribution overcounterfactuals (equivalently error terms) e.g. for binary X
Y(x0) ⊥⊥ Y(x1) | X, Y
Completeness Conjecture There exists a distribution overcounterfactuals that is experimentally indistinguishable fromthe NPSEM that assumes independent errors but in which theonly non-logical independencies are those that may bederived from the SWIG.
Thomas Richardson Therme Vals Workshop Slide 71
Summary
A simple approach to unifying graphs and counterfactuals vianode-splitting
The approach works via linking the factorizations associatedwith the two graphs
The approach provides a language that allows counterfactualand graphical people to communicate
The approach leads to many fewer untestable independenceassumptions than in the NPSEM-IE approach of Pearl.
The approach also provides a way to combine information onthe absence of individual and population level direct effects.
Thomas Richardson Therme Vals Workshop Slide 72
Thank You!
Thomas Richardson Therme Vals Workshop Slide 73
ReferencesPearl, J. Causality (Second ed.). Cambridge, UK: CambridgeUniversity Press, 2009.
Richardson, TS, Robins, JM. Single World Intervention Graphs.CSSS Technical Report No. 128http://www.csss.washington.edu/Papers/wp128.pdf, 2013.
Robins, JM A new approach to causal inference in mortality studieswith sustained exposure periods applications to control of thehealthy worker survivor effect. Mathematical Modeling 7,1393–1512, 1986.
Robins, JM, VanderWeele, TJ, Richardson TS. Discussion of“Causal effects in the presence of non compliance a latent variableinterpretation by Forcina, A. Metron LXIV (3), 288–298, 2007.
Shpitser, I, Pearl, J. What counterfactuals can be tested. Journal ofMachine Learning Research 9, 1941–1979, 2008.
Spirtes, P, Glymour, C, Scheines R. Causation, Prediction andSearch. Lecture Notes in Statistics 81, Springer-Verlag.
Thomas Richardson Therme Vals Workshop Slide 74
Details on Pearl’s ErrorPearl correctly states that using his Twin Network method (next slide) itmay be shown that
Y(x0, x1) is not independent of X1, given Z and X0.
However, he then goes on to say (incorrectly):
In the twin network model there is a d-connected path fromX1 to Y(x0, x1). . . Therefore, [(3)] is not satisfied for Y(x0, x1)and X1.
[Ex. 11.3.3, p.353]This is actually incorrect in two ways:
Y(x0, x1) 6⊥⊥ X1 | Z,X0 does not imply Y(x0, x1) 6⊥⊥ X1 | Z,X0=x0
d-separation is not complete for Twin Networks so the presence of ad-connected path does not imply that an independence is notimplied.
Thomas Richardson Therme Vals Workshop Slide 75
T
X
Z
Y
T
X
Z
Y
UT
UX
UZ
UY
T(z)
X(z)
z
Y(z)
T
X
Z
z
Y(z)
T and Y(z) are d-connected given X in the twin-network, but in spite ofthis T ⊥⊥ Y(z) | X under the associated NPSEM-IE because X(z) = X,and T and Y(z) are d-separated given X in the twin-network.
Thomas Richardson Therme Vals Workshop Slide 76
Mediation graph (I)Intervention on Z alone.
X YZ X(z)Z z Y(z)⇒
factorization:
P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z))
modularity:
P(X(z)=x) = P(X=x | Z= z),
P(Y(z)=y | X(z)=x) = P(Y=y | X=x,Z= z).
d-separation gives:Z ⊥⊥ X(z), Y(z).
Thomas Richardson Therme Vals Workshop Slide 77
Mediation graph (II)Intervention on Z and X:
X YZ X(z) xZ z Y(x, z)⇒
factorization:
P(Z,X(z), Y(x, z)) = P(Z)P(X(z))P(Y(x, z))
modularity:
P(X(z)=x) = P(X=x | Z= z),
P(Y(x, z)=y) = P(Y=y | X= x,Z= z).
d-separation gives:
Z ⊥⊥ X(z) ⊥⊥ Y(x, z)
Thomas Richardson Therme Vals Workshop Slide 78
Importance of fixed nodesCompare:
X YZ X(z)Z z Y(z)⇒
P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z))
P(Y(z)=y | X(z) = x) = P(Y=y | X=x,Z= z).
versus
X YZ X(z)Z z Y(z)⇒
P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z)),
P(Y(z)=y | X(z) = x) = P(Y=y | X=x)
Thomas Richardson Therme Vals Workshop Slide 79
Importance of fixed nodes: leaving them out causesproblems!
X YZ X(z)Z Y(z)⇒
P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z))
P(Y(z)=y | X(z) = x) = P(Y=y | X=x,Z= z).
versus
X YZ X(z)Z Y(z)⇒
P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z)),
P(Y(z)=y | X(z) = x) = P(Y=y | X=x)
Red nodes are needed in order to read off modularity property from G(a).
Thomas Richardson Therme Vals Workshop Slide 80
No direct effect graph (I)
X YZ X(z)Z z Y(z)⇒
factorization:
P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z))
modularity:
P(X(z)=x) = P(X=x | Z= z),
P(Y(z)=y | X(z) = x) = P(Y=y | X=x).
d-separation gives:Z ⊥⊥ X(z), Y(z)
Thomas Richardson Therme Vals Workshop Slide 81