Single World Intervention Graphs...

Single World Intervention Graphs (SWIGs):Unifying the Counterfactual and Graphical Approaches to Causality

Thomas Richardson

Department of Statistics

University of Washington

Joint work with James Robins (Harvard School of Public Health)

Therme Vals Causal Workshop

5 Aug 2013

Outline

Brief review of counterfactualsA new unification of graphs and counterfactuals vianode-splitting

I Factorization and ModularityI Contrast with Twin Network approachI Some Examples and ExtensionsI Sequentially Randomized Experiments / Time Dependent

ConfoundingI Dynamic Regimes

Experimental Testability and Independence of Errors inNPSEMs

Thomas Richardson Therme Vals Workshop Slide 1

Counterfactualsaka Potential Outcomes


The potential outcomes framework: philosophy

Hume (1748) An Enquiry Concerning Human Understanding:

We may define a cause to be an object followed by another, andwhere all the objects, similar to the first, are followed by objectssimilar to the second, . . .

. . . where, if the first object had not been the second never hadexisted.


The potential outcomes framework: crop trials

Jerzy Neyman (1923):To compare v varieties [on m plots] we will consider numbers:

U11 . . . U1m...

...Uv1 . . . Uvm

Here Uij is the crop yield that would be observed if variety i wereplanted in plot j.

Physical constraints only allow one variety to be planted in a givenplot in any given growning season.

Popularized by Rubin (1974); sometimes called the ‘Rubin causal model’.


Potential outcomes with binary treatment

For binary treatment X and response Y, we define two potentialoutcome variables:

Y(x = 0): the value of Y that would be observed for a givenunit if assigned X = 0;

Y(x = 1): the value of Y that would be observed for a givenunit if assigned X = 1;

WIll also write these as Y(x0) and Y(x1).Implicit here is the assumption that these outcomes arewell-defined. Specifically:

I Only one version of treatment X = xI No interference between units (SUTVA).


Potential Outcomes

Unit Potential Outcomes ObservedY(x = 0) Y(x = 1) X Y

1 0 12 0 13 0 04 1 15 1 0


Drug Response ‘Types’:

In the simplest case where Y is a binary outcome we have thefollowing 4 types:

Y(x0) Y(x1) Name0 0 Never Recover0 1 Helped1 0 Hurt1 1 Always Recover


Assignment to Treatments


1 0 1 12 0 1 03 0 0 14 1 1 15 1 0 0


Observed Outcomes from Potential Outcomes


1 0 1 1 12 0 1 0 03 0 0 1 04 1 1 1 15 1 0 0 1


Potential Outcomes and Missing Data


1 ? 1 1 12 0 ? 0 03 ? 0 1 04 ? 1 1 15 1 ? 0 1


Average Causal Effect (ACE) of X on Y

ACE(X→ Y) ≡ E[Y(x1) − Y(x0)]

= p(Helped) − p(Hurt) ∈ [−1, 1]

Thus ACE(X→ Y) is the difference in % recovery ifeveryone treated (X = 1) vs. if noone treated (X = 0).


Identification of the ACE under randomization

If X is assigned randomly then

X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1) (1)

hence

E[Y(x1) − Y(x0)] = E[Y(x1)] − E[Y(x0)]

= E[Y(x1) | X = 1] − E[Y(x0) | X = 0]

= E[Y | X = 1] − E[Y | X = 0].

Thus if (1) holds then ACE(X→ Y) is identified from P(X, Y).


Inference for the ACE without randomizationSuppose that we do not know that X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1).What can be inferred?

X = 0 X = 1Placebo Drug

Y = 0 200 600Y = 1 800 400

What is:

The largest number of people who could be Helped?400 + 200

The smallest number of people who could be Hurt? 0

⇒ Max value of ACE: (200 + 400)/2000 − 0 = 0.3

Similar logic:

⇒ Min value of ACE: 0 − (600 + 800)/2000 = −0.7


Inference for the ACE without randomization

Suppose that we do not know that X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1).

General case:

−(P(x=0,y=1) + P(x=1,y=0)) 6 ACE(X→ Y)

ACE(X→ Y) 6 P(x=0,y=0) + P(x=1,y=1)

⇒ Bounds will always cross zero.

⇒ X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1) essential for drawing non-trivialcausal inferences.


Summary of Counterfactual Approach

In our observed data, for each unit one outcome will be‘actual’; the others will be ‘counterfactual’.

The potential outcome framework allowsCausation to be ‘reduced’ to Missing Data⇒ Conceptual progress!

The ACE is identified if X ⊥⊥ Y(xi) for all values xi.

Randomization of treatment assignment implies X ⊥⊥ Y(xi).

Ideas are central to Fisher’s Exact Test; also many parts ofexperimental design.

The framework is the basis of many practical causal dataanalyses published in Biostatistics, Econometrics andEpidemiology.


Relating Counterfactuals and Structural Equations

Potential outcomes can be seen as a different notation forNon-Parametric Structural Equation Models (NPSEMs):

Example: X→ Y.

NPSEM formulation: Y = f(X, εY)

Potential outcome formulation: Y(x) = f(x, εY)

Two important caveats:

NPSEMs typically assume all variables are seen as beingsubject to well-defined interventions (not so with potentialoutcomes)

Pearl associates NPSEMs with Independent Errors(NPSEM-IEs) with DAGs (more on this later).


Relating Counterfactuals and ‘do’ notation

Expressions in terms of ‘do’ can be expressed in terms ofcounterfactuals:

P(Y(x) = y) ≡ P(Y = y | do(X = x))

but counterfactual notation is more general. Ex. Distribution ofoutcomes that would arise among those who took treatment(X = 1) had counter-to-fact they not received treatment:

P(Y(x = 0) = y | X = 1)

If treatment is randomized, so X ⊥⊥ Y(x = 0) then this equalsP(Y(x = 0) = y), but in an observational study these may bedifferent.


Graphs


Recap: Graphical Approach to Causality

X Y

No Confounding

X

H

Y

Confounding

Unobserved

Graph intended to represent direct causal relations.

Convention that confounding variables (e.g. H) are always includedon the graph.

Approach originates in the path diagrams introduced by SewallWright in the 1920s.

If X→ Y then X is said to be a parent of Y; Y is child of X.


Edges are directed, but are they causal?

X Y

P(X, Y) = P(X)P(Y | X)

No Confounding

X Y

P(X, Y) = P(Y)P(X | Y)

No Confounding

Neither factorization places any restriction on P(X, Y).


Linking the two approaches

X Y

X ⊥⊥ Y(x0) & X ⊥⊥ Y(x1)

X

H

Y

X 6⊥⊥ Y(x0) & X 6⊥⊥ Y(x1)

Unobserved

Elephant in the room:The variables Y(x0) and Y(x1) do not appear on thesegraphs!!


Node splitting: Setting X to 0

X Y

P(X= x, Y= y) = P(X= x)P(Y= y | X= x)

⇒ X x = 0 Y(x = 0)

Can now ‘read’ the independence: X ⊥⊥ Y(x=0).Also associate a new factorization:

P (X= x, Y(x=0)= y) = P(X= x)P (Y(x=0)= y)

where:P (Y(x=0)= y) = P(Y= y |X=0).

This last equation links a term in the original factorization to thenew factorization. We term this the ‘modularity assumption’.


Node splitting: Setting X to 1

X Y

P(X= x, Y= y) = P(X= x)P(Y= y | X= x)

⇒ X x = 1 Y(x = 1)

Can now ‘read’ the independence: X ⊥⊥ Y(x=1).Also associate a new factorization:

P (X= x, Y(x=1)= y) = P(X= x)P (Y(x=1)= y)

where:P (Y(x=1)=y) = P(Y=y |X=1).


Crucial point: Y(x=0) and Y(x=1) are never on the same graph.Although we have:

X ⊥⊥ Y(x=0) and X ⊥⊥ Y(x=1)

we do not haveX ⊥⊥ Y(x=0), Y(x=1)

Had we tried to construct a single graph containing both Y(x=0)and Y(x=1) this would have been impossible. (Why?)

⇒ Single-World Intervention Graphs (SWIGs).


Representing both graphs via a ‘template’

X Y

G

⇒ X x Y(x)

G(x)

Represent both graphs via a template:

Formally this is a ‘graph valued function’:

Takes as input a specific value x∗

Returns as output a SWIG G(x∗).

Each instantiation of the template is a SWIG G(x∗) that representsa different margin: P(X, Y(x∗)) with red nodes x∗ becomingconstants.


Intuition behind node splitting:(Robins, VanderWeele, Richardson 2007)

Q: How could we identify whether someone would choose to taketreatment, i.e. have X = 1, and at the same time find out whathappens to such a person if they don’t take treatment Y(x = 0)?

A: Consider an experiment in which, whenever a patient isobserved to swallow the drug have X = 1, we instantly interveneby administering a safe ‘emetic’ that causes the pill to beregurgitated before any drug can enter the bloodstream.Since we assume the emetic has no side effects, the patient’srecorded outcome is then Y(x = 0).


Harder Inferential problem

X0 Z

H X1

Y

Query: does this causal graph imply?

Y(x0, x1) ⊥⊥ X1(x0) | Z(x0),X0,


Simple solution

X0 Z

H X1

Y

X0

x0

Z(x0)

H

X1(x0)

x1

Y(x0, x1)

Query does this graph imply:

Y(x0, x1) ⊥⊥ X1(x0) | Z(x0),X0 ?

Answer: Yes – applying d-separation to the SWIG on the right wesee that there is no d-connecting path from Y(x0, x1) given Z(x0).More on this shortly...


Single World Intervention Template Construction (1)Given a graph G, a subset of vertices A = {A1, . . . ,Ak} to be intervenedon, we form G(a) in two steps:

(1) (Node splitting): For every A ∈ A split the node into a randomnode A and a fixed node a:

A

· · ·

· · ·

⇒ A

a

Splitting: Schematic Illustrating the Splitting of Node A

The random half inherits all edges directed into A in G;

The fixed half inherits all edges directed out of A in G.


Single World Intervention Template Construction (2)

(2) Relabel descendants of fixed nodes:

a ⇒A

B C

D

FE

X

T

Y

Z

· · ·

· · ·

· · ·

· · ·

a

A(. . .)

B(a, . . .) C(a, . . .)

D(a, . . .)

F(a, . . .)E(a, . . .)

X(. . .)

T(. . .)

Y(. . .)

Z(. . .)

· · ·

· · ·

· · ·

· · ·


Single World Intervention Graph

A Single World Intervention Graph (SWIG) G(a∗) is obtained fromthe Template G(a) by simply substituting specific values a∗ for thevariables a in G(a);

For example, we replace G(x) with G(x=0).

Changing the value of a fixed variable corresponds toconstructing a new graph and considering a differentpopulation, e.g. P(X, Y(x=0)) vs. P(X, Y(x=1))

It is only the instantiated graph G(x) that represents P(V(x)),not the template G(x).


Factorization and Modularity

Factorization: P(V(a)) over the counterfactual variables in G(a)factorizes with respect to G(a) (ignoring fixed nodes):

P (V(a)) =∏

Y(a)∈V(a)P(Y(a)

∣∣∣ paG(a)(Y(a)) \ a)

.

Modularity: P(V(a)) and P(V) are linked as follows:

P(Y(a)=y

∣∣∣ (paG(a)(Y(a)) \ a)= q

)= P

(Y=y

∣∣∣ (paG(Y) \A)= q,

(paG(Y) ∩A

)= apaG(Y)∩A

),

So the conditional density associated with Y(aY) in G(a) is just theconditional density associated with Y in G after substituting ai for anyAi ∈ A that is a parent of Y.


Applying d-separation to the graph G(a)

Counterfactual conditional independence relations may beobtained from the transformed graph by applying d-separation afteradding fixed nodes to the conditioning set:

Given disjoint subsets B(a), C(a) and D(a) of random vertices(where D(a) may be empty),

if B(a) is d-separated from C(a) given D(a) ∪ a in G(a) (2)

then B(a) ⊥⊥ C(a) | D(a) [P(V(a))].

In words, if in G(a) two subsets B(a) and C(a) of random nodes ared-separated by D(a) in conjunction with the fixed nodes a, then B(a) andC(a) are conditionally independent given D(a) in the associateddistribution P(V(a)).


Conditioning on fixed variables a

intuitive since these are fixed constants in the SWIG

Since vertices in a have no parents, no new paths d-connectdue to also conditioning on a.

⇒ If a d-separation holds in G(a) without conditioning on thefixed nodes, then it will continue to hold if we also condition onfixed nodes.

An alternative is simply to restrict attention to paths that donot contain fixed vertices,

e.g. remove fixed nodes from the graph before checkingd-separation.


Mediation graph (I)Intervention on Z alone.

X YZ X(z)Z z Y(z)⇒

factorization:

P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z))

modularity:

P(X(z)=x) = P(X=x | Z= z),

P(Y(z)=y | X(z)=x) = P(Y=y | X=x,Z= z).

d-separation gives:Z ⊥⊥ X(z), Y(z).


Mediation graph (II)Intervention on Z and X:

X YZ X(z) xZ z Y(x, z)⇒

factorization:

P(Z,X(z), Y(x, z)) = P(Z)P(X(z))P(Y(x, z))

modularity:

P(X(z)=x) = P(X=x | Z= z),

P(Y(x, z)=y) = P(Y=y | X= x,Z= z).

d-separation gives:

Z ⊥⊥ X(z) ⊥⊥ Y(x, z)


No direct effect graph

X YZ X(z) xZ z Y(x)⇒

factorization:

P(Z,X(z), Y(x)) = P(Z)P(X(z))P(Y(x))

modularity:

P(X(z)=x) = P(X=x | Z= z),

P(Y(x)=y) = P(Y=y | X= x).

d-separation gives:

Z ⊥⊥ X(z) ⊥⊥ Y(x)


Inferential Problem (II)

X0 Z

H X1

Y

X0

x0

Z(x0)

H

X1(x0)

x1

Y(x0, x1)

Pearl (2009), Ex. 11.3.3, claims the causal DAG above does not imply:

Y(x0, x1) ⊥⊥ X1 | Z,X0 = x0. (3)

The SWIG shows that (3) does hold; Pearl is incorrect.Specifically, we see from the SWIG:

Y(x0, x1) ⊥⊥ X1(x0) | Z(x0),X0, (4)

⇒ Y(x0, x1) ⊥⊥ X1(x0) | Z(x0),X0 = x0. (5)

This last condition is then equivalent to (3) via consistency.(Pearl infers a claim of Robins is false since if true then (3) would hold).


Pearl’s twin network for the same problem

X0

Z

H

X1

Y

X0

Z

H

X1

Y

x0

Z(x0, x1)

H(x0, x1)

x1

Y(x0, x1)

UZ

UH

UY

The twin network fails to reveal that Y(x0, x1) ⊥⊥ X1 | Z,X0 = x0.This ‘extra’ independence holds in spite of d-connection because (byconsistency) when X0 = x0, then Z = Z(x0) = Z(x0, x1).Note that Y(x0, x1) 6⊥⊥ X1 | Z,X0 6= x0.

Shpitser & Pearl (2008) introduce a pre-processing step to address this.


Confounding Revisited

X x Y(x)

H

Here we can read directly from the template that X 6⊥⊥ Y(x) sincethere is a path:

X← H→ Y(x).


Adjusting for confounding

X Y

L

X x Y(x)

L

Here we can read directly from the template that

X ⊥⊥ Y(x) | L.

It follows that:

P(Y(x)=y) =∑l

P(Y=y | L= l,X= x)P(L= l). (6)


Contrast with approach via removing edges

X Y

L

X x Y(x)

L

X Y

L

This ‘explains’ why L is sufficient to control confounding under thenull (where X has no effect on Y) but not under the alternative.


Adjusting for confounding

X Y

L

X x Y(x)

L

X ⊥⊥ Y(x) | L.

Proof of identification:

P[Y(x) = y] =∑l

P[Y(x) = y | L = l]P(L = l)

=∑l

P[Y(x) = y | L = l,X = x]P(L = l) indep

=∑l

P[Y = y | L = l,X = x]P(L = l) modularity


More Examples (I)

X Y

L

H

(a-i)

X x Y(x)

L

H

(a-ii)


X ⊥⊥ Y(x) | L.


More Examples (II)

X Y

L

H

(b-i)

X x Y(x)

L

H

(b-ii)


X ⊥⊥ Y(x) | L.


Sequentially randomized experiment (I)

A B C D

H

A and C are treatments;

H is unobserved;

B is a time varying confounder;

D is the final response;

Treatment C is assigned randomly conditional on the observedhistory, A and B;

Want to know P(D(a, c)).


Sequentially randomized experiment (I)

A B C D

H

If the following holds:

A ⊥⊥ D(a, c)

C(a) ⊥⊥ D(a, c) | B(a),A

General result of Robins (1986) then implies:

P(D(a, c)=d) =∑b

P(B=b | A= a)P(D=d | A= a,B=b,C= c).

Does it??


Sequentially randomized experiment (II)

A a B(a) C(a) c D(a, c)

H

d-separation:

A ⊥⊥ D(a, c)

C(a) ⊥⊥ D(a, c) | B(a),A


P(D(a, c)=d) =∑b

P(B=b | A= a)P(D=d | A= a,B=b,C= c).


Multi-network approach

A B C D

H

UHUB

UC

UD

a B(a) C(a) D(a)

H(a)

a B(a, c) c D(a, c)

H(a, c)


Another example

A B C D

H2H1

A ⊥⊥ D(a, c)

C(a) ⊥⊥ D(a, c) | B(a),A


P(D(a, c)=d) =∑b

P(B=b | A= a)P(D=d | A= a,B=b,C= c).

Does it??


Another example

A B C D

H2H1

A

aB(a)

C(a)

cD(a, c)

H1 H2

A ⊥⊥ D(a, c)

C(a) ⊥⊥ D(a, c) | B(a),A


P(D(a, c)=d) =∑b

P(B=b | A= a)P(D=d | A= a,B=b,C= c).


General result (Robins, 1986)Observed data:

O ≡ 〈L1,A1, . . . ,LK,AK, Y〉.

If the following holds for k = 1, . . . ,K

Y(a†) ⊥⊥ Ak(a†) | Lk(a

†),Ak−1(a†); (7)

then (under positivity):

P(Y(a†)=y |Lj(a†) = lj,Aj−1(a

†) = a†j−1)

=∑

lm+1,...,lK

p(y|lK, a†K)K∏

j=m+1

p(lj|lj−1, a†j−1). (8)

Here Aj−1(a†) ≡ 〈A1, . . . ,Aj−1(a

†j−2)〉, similarly for Lj−1(a

†).The RHS of (8) is referred to as the ‘g-formula’.


Dynamic regimes

A dynamic regime g is a policy that assigns treatment (usuallyat multiple time points) on the basis of past history;

Including conditional on the ‘natural’ value of treatment in theabsence of an intervention;

Exercise for as long as you would have done withoutintervention or twenty minutes, whichever is more.

See Young et al. (2012) for additional analysis.


Dynamic regimes

A1

a1

L(a1) A2(a1)

a2

Y(a1,a2)

H2H1

A1

A+1 (g)

L(g) A2(g)

A+2 (g)

Y(g)

H2H1

P(Y(g)) is identified.Thomas Richardson Therme Vals Workshop Slide 54

Dynamic regimes

A1

a1

L(a1) A2(a1)

a2

Y(a1,a2)

H2H1

A1

A+1 (g)

L(g) A2(g)

A+2 (g)

Y(g)

H2H1

P(Y(g)) is not identified.Thomas Richardson Therme Vals Workshop Slide 55

Joint Independence

We saw earlier that the causal DAG X→ Y implied:

X ⊥⊥ Y(x0) and X ⊥⊥ Y(x1)

However, joint independence relations such as:

X ⊥⊥ Y(x0), Y(x1)

never follow from our SWIG transformation:There is no way via node-splitting to construct a graph with bothY(x0), and Y(x1).This has important consequences for the identification of directeffects.


Assuming Independent Errors and

Cross-World Independence


Mediation graphIntervention on X and M:

M YX M(x) mX x Y(x, m)⇒

d-separation gives:

X ⊥⊥ M(x) ⊥⊥ Y(x, m)

Pearl associates additional independence relations with this graph

Y(x1,m) ⊥⊥ M(x0),X

Y(x0,m) ⊥⊥ M(x1),X

equivalent to assuming independent errors, εX ⊥⊥ εM ⊥⊥ εY .Thomas Richardson Therme Vals Workshop Slide 58

Pure Direct Effect

Pure (aka Natural) Direct Effect (PDE): Change in Y had X beendifferent, but M fixed at the value it would have taken had X notbeen changed:

PDE ≡ Y(x1,M(x0)) − Y(x0,M(x0)).

Legal motivation [from Pearl (2000)]:

“The central question in any employment-discrimination case is whetherthe employer would have taken the same action had the employee beenof a different race (age, sex, religion, national origin etc.) and everythingelse had been the same.” (Carson versus Bethlehem Steel Corp., 70FEP Cases 921, 7th Cir. (1996)).


Decomposition

PDE also allows non-parametric decomposition of Total Effect(ACE) into direct (PDE) and indirect (TIE) pieces.

PDE ≡ E [Y(1,M(0))] − E [Y(0)]

TIE ≡ E [Y (1,M(1)) − Y (1,M(0))]

TIE+ PDE ≡ E [Y(1)] − E [Y(0)] ≡ ACE(X→ Y)


Pearl’s identification claim

Pearl and others claim that under “no confounding” the PDE isidentified by the following mediation formula:

PDEmed(m) =∑m

[E[Y|x1,m] − E[Y|x0,m]]P(m|x0)


Critique of PDE: Hypothetical Case Study

Observational data on three variables:

X- treatment: cigarette cessation

M intermediate: blood pressure at 1 year, high or low

Y outcome: say CHD by 2 years

Observed data (X,M, Y) on each of n subjects.

All binary

X randomly assigned


Hypothetical Study (I): X randomized

Y = 0 Y = 1 Total P(Y=1 |m, x)

M = 0 1500 500 2000 0.25X = 0

M = 1 1200 800 2000 0.40

M = 0 948 252 1200 0.21X = 1

M = 1 1568 1232 2800 0.44

A researcher, Prof H wishes to apply the mediation formula toestimate the PDE.Prof H believes that there is no confounding, so that Pearl’sNPSEM-IE holds, but his post-doc, Dr L is skeptical.


Hypothetical Study (II): X and M Randomized

To try to address Dr L’s concerns, Prof H carries out animalintervention studies.

Y = 0 Y = 1 Total P(Y(m, x)=1)M = 0 750 250 1000 0.25

X = 0M = 1 600 400 1000 0.40

M = 0 790 210 1000 0.21X = 1

M = 1 560 440 1000 0.44

As we see: P(Y(m, x)=1) = P(Y=1 |m, x);Prof H is now convinced: ‘What other experiment could I do ?’

He applies the mediation formula, yielding PDEmed

= 0.Conclusion: No direct effect of X on Y.


Failure of the mediation formula

Under the true generating process, the true value of the PDE is:

PDE = 0.153 6= PDEmed

= 0

Prof H’s conclusion was completely wrong!


Why did the mediation formula go wrong?

Dr L was right – there was a confounder:

M YX

H

but. . . it had a special structure so that:

Y ⊥⊥ H |M,X = 0 and M ⊥⊥ H | X = 1


Why did the mediation formula go wrong?

Dr L was right – there was a confounder:

M YX

H

but. . . it had a special structure so that:

Y ⊥⊥ H |M,X = 0 and M ⊥⊥ H | X = 1

M Y

HX = 0

M Y

HX = 1

The confounding undetectable by any intervention on X and/or M.

Pearl: Onus is on the researcher to be sure there is no confounding.Causation should precede intervention.


PDE identification cannot be checked via experimentIf our only interventions are on the variables X and M then wecannot do an experiment to learn the PDE.

We could learn E [Y{x = 1,M(x = 0)}] by intervention if we couldI intervene and set X to 0 and observe M(0),I then return each subject to their pre-intervention state,I finally intervene to set X to 1 and M to M(0) and observeY(1,M(0)).

Such an intervention strategy will usually not exist because notpossible in a real-world intervention (e.g., suppose the outcome Ywere death).

Because we cannot observe the same subject under both X = 1and X = 0 (i.e. ”across worlds”,) no intervention will allow us tolearn the distribution of mixed counterfactuals such asY{x = 1,M(x = 0)} :

(In the story Dr L had to introduce a new node on the graph in order tocheck the value of the PDE via an experiment.)


Summary of critique of Independent Error AssumptionThe independent error assumption cannot be checked by anyrandomized experiment on the variables in the graph.

⇒ Connection between experimental interventions and potentialoutcomes, established by Neyman has been severed;

⇒ Theories in Social and Medical sciences are not detailed enough tosupport the independent error assumption.

What about faithfulness and causal discovery procedures?

Such inferences are explicit that they rely on faithfulness, and aredesigned to guide hypothesis formation;

I Contrast: In Pearl’s NPSEM-IE approach the simple act ofusing a DAG is viewed as automatically committing you tomaking this untestable hypothesis.

Predictions (possibly derived assuming faithfulness) regardingintervention distributions P(Y(x)) = P(Y | do(x)) can be tested byrandomized experiments.


How many experimentally untestable assumptions?Assumption of independent errors implies super-exponentiallymany ‘cross-world’ counterfactual independence assumptions:

No. Actual Vars. 2 3 4 K

Dim. P(V) 3 7 15 2K − 1No. Cnterfactual Vars. 3 7 15 2K − 1

Dim. Cnterfactual Dist. 7 127 32767 2(2K−1) − 1

Dim. SWIG 5 113 32697 (2(2K−1) − 1) −∑K−1

j=1 (4j − 2j)

Dim. NPSEM-IE 4 19 274∑K−1

j=0 (22j

− 1)

No. untestable indep. 1 94 32423 O(22K−2)constrnts in NPSEM-IE

Table: Dimensions of counterfactual models associated with completegraphs with binary variables.


SWIG Completeness Conjecture

In an NPSEM we define a counterfactual independence to belogical if it holds regardless of the distribution overcounterfactuals (equivalently error terms) e.g. for binary X

Y(x0) ⊥⊥ Y(x1) | X, Y

Completeness Conjecture There exists a distribution overcounterfactuals that is experimentally indistinguishable fromthe NPSEM that assumes independent errors but in which theonly non-logical independencies are those that may bederived from the SWIG.


Summary

A simple approach to unifying graphs and counterfactuals vianode-splitting

The approach works via linking the factorizations associatedwith the two graphs

The approach provides a language that allows counterfactualand graphical people to communicate

The approach leads to many fewer untestable independenceassumptions than in the NPSEM-IE approach of Pearl.

The approach also provides a way to combine information onthe absence of individual and population level direct effects.


Thank You!


ReferencesPearl, J. Causality (Second ed.). Cambridge, UK: CambridgeUniversity Press, 2009.

Richardson, TS, Robins, JM. Single World Intervention Graphs.CSSS Technical Report No. 128http://www.csss.washington.edu/Papers/wp128.pdf, 2013.

Robins, JM A new approach to causal inference in mortality studieswith sustained exposure periods applications to control of thehealthy worker survivor effect. Mathematical Modeling 7,1393–1512, 1986.

Robins, JM, VanderWeele, TJ, Richardson TS. Discussion of“Causal effects in the presence of non compliance a latent variableinterpretation by Forcina, A. Metron LXIV (3), 288–298, 2007.

Shpitser, I, Pearl, J. What counterfactuals can be tested. Journal ofMachine Learning Research 9, 1941–1979, 2008.

Spirtes, P, Glymour, C, Scheines R. Causation, Prediction andSearch. Lecture Notes in Statistics 81, Springer-Verlag.


Details on Pearl’s ErrorPearl correctly states that using his Twin Network method (next slide) itmay be shown that

Y(x0, x1) is not independent of X1, given Z and X0.

However, he then goes on to say (incorrectly):

In the twin network model there is a d-connected path fromX1 to Y(x0, x1). . . Therefore, [(3)] is not satisfied for Y(x0, x1)and X1.

[Ex. 11.3.3, p.353]This is actually incorrect in two ways:

Y(x0, x1) 6⊥⊥ X1 | Z,X0 does not imply Y(x0, x1) 6⊥⊥ X1 | Z,X0=x0

d-separation is not complete for Twin Networks so the presence of ad-connected path does not imply that an independence is notimplied.


T

X

Z

Y

T

X

Z

Y

UT

UX

UZ

UY

T(z)

X(z)

z

Y(z)

T

X

Z

z

Y(z)

T and Y(z) are d-connected given X in the twin-network, but in spite ofthis T ⊥⊥ Y(z) | X under the associated NPSEM-IE because X(z) = X,and T and Y(z) are d-separated given X in the twin-network.


Mediation graph (I)Intervention on Z alone.


factorization:


modularity:

P(X(z)=x) = P(X=x | Z= z),

P(Y(z)=y | X(z)=x) = P(Y=y | X=x,Z= z).

d-separation gives:Z ⊥⊥ X(z), Y(z).


Mediation graph (II)Intervention on Z and X:

X YZ X(z) xZ z Y(x, z)⇒

factorization:

P(Z,X(z), Y(x, z)) = P(Z)P(X(z))P(Y(x, z))

modularity:

P(X(z)=x) = P(X=x | Z= z),

P(Y(x, z)=y) = P(Y=y | X= x,Z= z).

d-separation gives:

Z ⊥⊥ X(z) ⊥⊥ Y(x, z)


Importance of fixed nodes: leaving them out causesproblems!

X YZ X(z)Z Y(z)⇒


P(Y(z)=y | X(z) = x) = P(Y=y | X=x,Z= z).

versus

X YZ X(z)Z Y(z)⇒

P(Z,X(z), Y(z)) = P(Z)P(X(z))P(Y(z) | X(z)),

P(Y(z)=y | X(z) = x) = P(Y=y | X=x)

Red nodes are needed in order to read off modularity property from G(a).


No direct effect graph (I)


factorization:


modularity:

P(X(z)=x) = P(X=x | Z= z),

P(Y(z)=y | X(z) = x) = P(Y=y | X=x).

d-separation gives:Z ⊥⊥ X(z), Y(z)


Single World Intervention Graphs...

Documents

Transcript of Single World Intervention Graphs...