Tell cause from effect: models and evaluation - Springer · PDF fileInt J Data Sci Anal (2017)...

Int J Data Sci Anal (2017) 4:99–112DOI 10.1007/s41060-017-0063-0

REGULAR PAPER

Tell cause from effect: models and evaluation

Jing Song1 · Satoshi Oyama1 · Masahito Kurihara1

Received: 20 July 2016 / Accepted: 30 June 2017 / Published online: 21 July 2017© Springer International Publishing AG 2017

Abstract Causal relationships differ from statistical rela-tionships, and distinguishing cause from effect is a fun-damental scientific problem that has attracted the interestof many researchers. Among causal discovery problems,discovering bivariate causal relationships is a special case.Causal relationships between two variables (“X causes Y” or“Y causes X”) belong to the sameMarkov equivalence class,and thewell-known independence tests and conditional inde-pendence tests cannot distinguish directed acyclic graphs inthe same Markov equivalence class. We empirically eval-uated the performance of three state-of-the-art models forcausal discovery in the bivariate case using both simulationand real-world data: the additive-noise model (ANM), thepost-nonlinear (PNL) model, and the information geomet-ric causal inference (IGCI) model. The performance metricswere accuracy, area under the ROC curve, and time to makea decision. The IGCI model was the fastest in terms of algo-rithm efficiency even when the dataset was large, while thePNLmodel took themost time tomake a decision. In terms ofdecision accuracy, the IGCI model was susceptible to noiseand thus performed well only under low-noise conditions.The PNL model was the most robust to noise. Simulationexperiments showed that the IGCI model was the most sus-

This paper is an revised and extended version of our conferencepaper [23].

B Jing [email protected]

Satoshi [email protected]

Masahito [email protected]

1 Graduate School of Information Science and Technology,Hokkaido University, Sapporo, Japan

ceptible to “confounding,” while the ANM and PNL modelswere able to avoid the effects of confounding to some degree.

Keywords Causal discovery · Bivariate · Accuracy · AUC ·Time to make a decision

1 Introduction

People are generally more concerned with causal relation-ships between variables than with statistical relationshipsbetween variables, and the concept of causality has beenwidely discussed [9,12]. The best way to demonstrate acausal relationship between variables is to conduct a con-trolled randomized experiment. However, real-world exper-iments are often expensive, unethical, or even impossible.Many researchers working in various fields (economics,sociology, machine learning, etc.) are thus using statisticalmethods to analyze causal relationships between variables[2,3,16,19,21,25,31,37,48,55].

Directed acyclic graphs (DAGs) have been used to for-malize the concept of causality [29]. Although a conditionalindependence test cannot tell the full story of a causal rela-tionship, it can be used to exclude irrelevant relationshipsbetween variables [29,44]. However, a conditional indepen-dence test is impossible when there are only two variables.Several models have been proposed to solve this problem[20,38,46,47,49]. For two variables X and Y, there are atleast six possible relationships between them (Fig. 1). The toptwo diagrams show the independent case and the feedbackcase, respectively. The middle two show the two possiblecausal relationships between X and Y: “X causes Y” and “Ycauses X.” The bottom two show the “common cause” caseand the “selection bias” case. Unobserved variables Z are

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s41060-017-0063-0&domain=pdf

100 Int J Data Sci Anal (2017) 4:99–112

Fig. 1 Possible relationships betweenX andY. a Independent. b Feed-back. cX causes Y. dY causes X. e “common cause.” f “selection bias”

“confounders”1 for causal discovery between X and Y. Theexistence of “confounders”2 creates a spurious correlation

1 For the definition of “confounding,” please refer to [13,51].2 The number of “confounders” is not limited to one.

between X and Y. Distinguishing spurious correlation dueto unobserved confounders from actual causality remains achallenging task in thefieldof causal discovery.Manymodelsare based on the assumption that no unobserved confoundersexist.3

In the work reported here, we experimentally comparedthe performance of three state-of-the-art causal discoverymodel, the additive-noise model (ANM) [15], the post-nonlinear (PNL) model [54], and the information geometriccausal inference (IGCI) model [22]. We used three metrics:accuracy, area under the ROC curve (AUC), and time tomakea decision. This paper is an extended and revised version ofour conference paper [23]. It includes new AUC results andthe updated time tomake a decision for thePNLmodel. It alsodescribes typical examples ofmodel failure and discusses thereasons for failure. Finally, it describes new experiments onthe responses of the models to spurious correlation causedby confounders using simulation and real-world data.

In Sect. 2, we discuss related work in the field of causaldiscovery. In Sect. 3, we briefly describe the three models. InSect. 4, we describe the dataset we used and the implemen-tations of the three models. In Sect. 5, we present the resultsand give a detailed analysis of the performances of the threemodels.We conclude in Sect. 6 by summarizing the strengthsand weaknesses of the three models and mentioning futuretasks.

2 Related work

Temporal information is useful for causal discovery mod-eling [30,32]. Granger [7] proposed detecting the causaldirection of time series data on the basis of the temporalordering of the variables and used linear systems to make itmore operational. He formulated the definition of causalityin terms of conditional independence relations [8]. Chen etal. extended the linear stochastic systems he proposed [7]to work on nonlinear systems [1]. Shajarisales et al. [39]proposed using the spectral independence criterion (SIC) forcausal inference from time series data and a mechanism dif-ferent from that used in Granger causality and comparedthe two methods. For Granger causality [7] and extendedGranger causality [1], temporal information is needed.Whendiscovering causal relationship from time series data, the dataresolution might be different from the true causal frequency.Gong et al. [6] discussed this issue and showed that usingnon-Gaussian of data can help identify the underlying modelunder some conditions.

3 There have been some efforts to deal with confounders. For example,Shimizu et al. [41] extended the linear non-Gaussian acyclic model todetect causal direction when there are “common causes” [14,40].

123

Int J Data Sci Anal (2017) 4:99–112 101

Shimizu et al. [41] proposed using a linear non-Gaussianacyclic model (LiNGAM for short) to detect the causal direc-tion of variables without the need for temporal information.LiNGAM works when the causal relationship between vari-ables is linear, the distributions of disturbance variables arenon-Gaussian, and the network structure can be expressedusing a DAG. Several extensions of LiNGAM have beenproposed [14,18,40,42].

LiNGAM is based on the assumption of linear relation-ships between variables. Hoyer et al. [15] proposed using anadditive- noisemodel (ANM) to deal with nonlinear relation-ships. If the regression function is linear, ANM works in thesame way as LiNGAM. Zhang et al. [52–54,56] proposedusing a PNL model that takes into account the nonlineareffect of causes, inner additive noise, and external sensordistortion. The ANM and PNL model are briefly introducedin the following section.

While the above models are based on structural equa-tion modeling (SEM), which requires structural constraintson the data generating process, another research directionis based on the assumption that independent mechanismsin nature generate causes and effects. The idea is that theshortest description of joint distribution p(cause, effect)can be expressed by p(cause)p(effect|cause). Comparedwith factorization into p(cause)p(effect|cause), p(effect)p(cause|effect) has lower total complexity. Although com-paring total complexity is an intuitive idea, Kolmogorovcomplexity and algorithmic information could be used tomeasure it [19].

Janzing et al. [4,22] proposed IGCI to infer asymme-try between cause and effect through the complexity lossof distributions. The IGCI model is briefly introduced inthe following section. Zhang et al. [57] proposed using abootstrap-based approach to detect causal direction. It isbased on the assumption that the parameters of the causesinvolved in the causality data generation process are exoge-nous to those of the cause to the effect. Stegle et al. [45]proposed using a probabilistic latent variable model (GPI) todistinguish between cause and effect using standardBayesianmodel selection.

In addition to the above studies on the causal relation-ship between two variables, there have been several reviews.Spirtes et al. discussed the main concepts of causal discoveryand introduced several models based on SEM [43]. Eber-hardt [5] discussed the fundamental assumptions of causaldiscovery and gave an overview of related causal discoverymethods. Several methods have been proposed for decidingthe causal directionbetween twovariables, and specificmeth-ods have been compared. However, as far as we know, therehas been little discussion of how to fairly compare meth-ods based on different assumptions. In the work describedabove, accuracy was usually used as the evaluation metric.Another commonly used metric for a binary classifier is the

AUC. Compared with accuracy, the ROC curve can show thetrade-off between the true-positive rate (TPR) and the false-positive rate (FPR) of a binary classifier. In our frameworkfor causal discovery models, we use AUC as an evaluationmetric. We have used it to obtain several new insights. Wealso used the time to make a decision as an evaluation metricsince it may become a performance bottleneck when dealingwith big data.

3 Models

We used the ANM, the PNL model, and the IGCI modelin the comparison experiments. The ANM and PNL modeldefine how causality data are generated in nature throughSEM. The assumption of additive noise is enlightening. TheIGCI model finds the asymmetry between cause and effectthrough the complexity loss of distributions. The assumptionof IGCI is intuitive and how well it works needs to be furtherresearched.

3.1 ANM

The additive-noise model of Hoyer et al. [15] is based on twoassumptions: (1) the observed effects (Y) can be expressedusing functional models of the cause (X) and additive noise(N) (Eq. 1); (2) the cause and additive noise are independent.If f () is a linear function and the noise has a non-Gaussiandistribution, the ANM works in the same way as LiNGAM[41]. The model is learned by performing regression in bothdirections and testing the independence between the assumedcause and noise (residuals) for each direction. The decisionrule is to choose the direction with the less dependence asthe true causal direction. The ANM cannot handle the linearGaussian case since the data can fit the model in both direc-tions, so the asymmetry between cause and effect disappears.Gretton et al. improved the algorithm and extended the ANMto work even in the linear Gaussian case [50]. The improvedmodel also works more efficiently in the multivariate case.

Y = f (X) + N (1)

3.2 PNL model

In the post-nonlinear model of Zhang et al. [53,54], effectsare nonlinear transformations of causeswith some inner addi-tive noise, followed by external nonlinear distortion (Eq. 2).From Eq. 2, we obtain N = f −1(Y )− g(X), where X and Yare the two observed variables representing cause and effect,respectively. To identify the cause and effect, a particulartype of constrained nonlinear ICA [17,53] is performed toextract two components that are as independent as possible.The two extracted components are the assumed cause and

123


corresponding additive noise, respectively. The identificationmethod of the model is described elsewhere ([53], Section4). The identifiability of the causal direction inferred by thePNL model has been proven [54]. The PNL model can iden-tify the causal direction of data generated in accordance withthe model except for the five situations described in Table 1in [54].

Y = f (g(X) + N ) (2)

3.3 IGCI

The IGCI model [4,22] is based on the hypothesis that if “Xcauses Y,” the marginal distribution p(x) and the conditionaldistribution p(y|x) are independent in a particular way. TheIGCI model gives an information-theoretic view of additivenoise and defines independence by using orthogonality.WithANM [15], if there is no additive noise, inference is impos-sible, while it is possible with the IGCI model.

The IGCI model determines the causal direction on thebasis of complexity loss. According to IGCI, the choice ofp(x) is independent of the choice of f for the relationshipy = f (x) + n. Let νx and νy be the reference distributions4

for X and Y.

D(Px || νx ) :=∫

logP(x)

ν(x)P(x)dx

is the KL-distance between Px and νx . D(Px || νx )works as afeature of the complexity of the distribution. The complexityloss from X to Y is given by

VX→Y := D(Px || νx ) − D(Py || νy).

The decision rule of the IGCI model is that if VX→Y < 0,infer “X causes Y,” and if VX→Y > 0, infer “Y causes X.”This rule is rather theoretical. An applicable and explicit formfor the reference measure is entropy-based IGCI or slope-based IGCI.

1. Entropy-based IGCI:

S(PX ) := ψ(m) − ψ(1) + 1

m − 1

m−1∑i=1

log |xi+1 − xi |

(3)

whereψ() is the digamma function5 andm is the numberof data points.

4 Reference distributions are used to measure the complexity of Px andPy . In [22], non-informative distributions like uniform and Gaussianones are recommended.5 https://en.wikipedia.org/wiki/Digamma_function.

VX→Y := S(PY ) − S(PX ) = −VY→X (4)

2. Slope-based IGCI:

VX→Y := 1

m − 1

m−1∑i=1

log

∣∣∣∣ yi+1 − yixi+1 − xi

∣∣∣∣ (5)

These explicit forms are simpler, and we can see that thetwo calculation methods coincide. The calculation does nottake much time even when dealing with a large amount ofdata. However, the IGCI model assumes that the causal pro-cess is noiseless and may perform poorly under high-noiseconditions. We discuss the performance of the three modelsin Sect. 5.

4 Experiments

Here we describe the dataset used in our experiments and theimplementation of each model.

4.1 Dataset

We used the cause effect pairs (CEP) [27] dataset, whichcontains 97 pairs of real-world causal variables with thecause and effect labeled for each pair. The dataset is pub-licly available online [27]. Some of the data were collectedfrom the UCI machine learning repository [24]. The datacome from various fields, including geography, biology,physics, and economics. The dataset also contains timeseries data. Most of the data are noisy. An appendix in[28] contains a detailed description of each pair of vari-ables.

We used 91 of the pairs in our experiments since someof the data (e.g., pair0052) contain multi-dimensional vari-ables.6 The 91 pairs are listed in Table 4 in “Appendix.” Somecontain the same variables collected in different countries orat different times.7 The variables range in size from 126 to16,382.8 The variety of data types in the CEP dataset makescausal analysis using real-world data challenging.

4.2 Implementation

We implemented the three models following the originalwork [15,22,53]. A brief introduction is given blow.

ANM Using the reported experimental settings [15], we per-formed Gaussian processes for machine learning regression

6 The three models we evaluated cannot deal with multi-dimensionaldata.7 Country and time information is not included in the table.8 To avoid overfitting, we limited the size to 500 or less.

123

https://en.wikipedia.org/wiki/Digamma_function


[33–35]. We then used the Hilbert–Schmidt IndependenceCriterion (HSIC) [10] to test the independence between theassumed cause and residuals. The dataset used had beenlabeled with the true causal direction for each pair with nocases of independence or feedback. Using the decision ruleof ANM, we determined that the direction with the greaterindependence was the true causal direction.

PNL Model We used a particular type of constrained non-linear ICA to extract the two components that would be thecause and noise if the model had been learned in the correctdirection. The nonlinearities of g() and f −1() in Eq. 2 weremodeled using multilayer perceptrons. By minimizing themutual information between the two output components, wemade the output as independent as possible. After extractingtwo independent components, we tested their independenceby using the HSIC [10,11]. Finally, in the same way as forthe ANM, we determined that the direction with the greaterindependence was the correct one.

IGCI (entropy, uniform) Compared with the first two mod-els, the implementation of the IGCI (entropy, uniform)modelwas simpler. We used reported equations (3, 4) to calculateVX→Y and VY→X and determined that the direction in whichentropy decreased was the correct direction. If VX→Y < 0,the inferred causal direction was “X causes Y”; otherwise, itwas “Y causes X.” For the IGCI model, the data should benormalized before calculating VX→Y and VY→X . In accor-dance with the reported experimental results, we used theuniform distribution as the reference distribution because ofits good performance. For the repetitive data in the dataset,we set log 0 = 0.

IGCI (slope, uniform) The implementation of the IGCI(slope, uniform) model was similar to that of the IGCI(entropy, uniform) one. We used Eq. 5 to calculate VX→Y

and VY→X and determined that the direction with a negativevalue was the correct one. For the same reason as above, wenormalized the data to [0,1] before calculating VX→Y andVY→X . To make Eq. 5 meaningful, we filtered out the repet-itive data.

5 Results

Here, we first compare model accuracy for different decisionrates.9 We changed the threshold and calculated the corre-sponding decision rate and accuracy for each model. Theaccuracy of the models for different decision rates has beencompared in the original study [4]. Compared with [4], we

9 Since all three models have two outputs, e.g., VX→Y , VY→X corre-sponding to the two possible causal directions, we set thresholds basedon the absolute difference between them |VX→Y − VY→X | for use indeciding each decision rate and model accuracy.

used more real-world data in our experiments. Besides, howaccuracy changed under different decision rates was showed.Our results are consistent with those shown of Mooij et al.[26]. The performance of the models for different decisionrates is discussed in Sect. 5.1.

Since causal discovery models in the bivariate case makea decision between two choices, we can regard these mod-els as binary classifiers and evaluate them using AUC.We previously divided the data into two groups (inferredas “X causes Y” and inferred as “Y causes X”) andevaluated the performance of each model for each group[23]. Here we give the results for the entire (undivided)dataset.

Finally, we comparemodel efficiency by using the averagetime needed tomake a decision. This is described in Sect. 5.3.

5.1 Accuracy for different decision rates

We calculated the accuracy of each model for different deci-sion rates using Eqs. 6 and 7. The results are plotted in(Fig. 2). The decision rate changed when the threshold waschanged. The larger the threshold, the more stringent thedecision rule. In an ideal situation, accuracy decreases as thedecision rate increases, with the starting point at 1.0. How-ever, the resultswith real-world datawere not perfect becausethe data were noisy.

As shown in Fig. 2, the accuracy started from 1.0 for theANM and IGCI and from 0.0 for the PNLmodel. This meansthat the PNL model made an incorrect decision when it hadthe highest confidence. Although the accuracies of the IGCImodels started from 1.0, they dropped sharply when the deci-sion rate was between 0 and 0.2. The reasons for this are

Fig. 2 Accuracy of three models for different decision rates. Decisionrate changed when threshold was changed: the larger the threshold, thesmaller the decision rate. In an ideal case, the accuracy of each modelshould improve with a decrease in the decision rate

123


discussed in detail in Sect. 5.4. After reaching a minimum,the accuracies increased almost continuously and stabilized.The accuracy of the ANM was more stable than those of theother models. When all decisions had been made, the modelaccuracies were ranked IGCI > ANM > PNL.

Decision rate := NDecision

NData(6)

Accuracy := NTrue Decision

NDecision(7)

5.2 Area under ROC curve (AUC)

Besides calculating the accuracy of the three models for dif-ferent decision rates, we used the AUC to evaluate theirperformance. Some of the experimental results were pre-sented in our conference paper [23], for which the datasetwas divided into two groups: inferred as “X causes Y” andinferred as “Y causes X.” Here we present updated experi-mental results for the entire dataset.

The following steps were taken to get the experimentalresults:

1. Set X as the cause and Y as the effect in the input data.2. Set VX→Y and VY→X to be the outputs.3. Calculate the absolute value of the difference between

VX→Y and VY→X (Eq. 8) and map Vdiff to [0,1].4. Assign a positive label to the pairs inferred as “X causes

Y” and a negative one to the pairs inferred as “Y causesX.”

5. Use Vdiff and the labels assigned in step 4 to calculate thetrue-positive rate (TPR) and false-positive rate (FPR) fordifferent thresholds.

6. Plot the ROC curve and calculate the correspondingAUCvalue.

Vdiff = |VX→Y − VY→X | (8)

In step (1), instead of dividing the data into two groups asdone previously [23], we set the input matrix so that the firstcolumn was the cause and the second column was the effect.Then, if the inferred causal direction for a pair was “X causesY,” a positive label was assigned to that pair; otherwise, anegative label was assigned.

In step (3), we used the absolute value of the differencebetween VX→Y and VY→X as the “confidence” of the modelwhen making a decision. The larger the Vdiff , the greater theconfidence. We did not use division because, if one of VX→Y

and VY→X was very small, the division result would be verylarge. We mapped Vdiff to [0,1] to make the calculation moreconvenient. In this way, Vdiff could be used in the same wayas the output of a binary classifier. For causal discovery, the

larger the Vdiff , the greater the confidence in the decision. Atthe same time, more punishment should be given when thedecision is incorrect.

In step (4), we labeled the data in accordance with theinferred causal direction. Since the correct label for all thepairs was “X causes Y,” if the inferred result for a pair was“Y causes X,” it was assigned a negative label.

In step (5), we used the normalized Vdiff and the labelassigned in step (4) to calculate TPR and FPR for differentthresholds. We plotted TPR and FPR to get the ROC curveand calculated the corresponding AUC value.

The ROC results are plotted in Fig. 3. The correspondingAUC values are shown in Table 1. Different from the resultsshown in Fig. 2, both IGCImodels performed poorly in termsof AUC. The AUC values for IGCI were smaller than 0.5,which means their performances were even worse than thatof a random classifier. However, as described in Sect. 5.1,when we used different decision rates, the IGCI models hadbetter performance.

We checked the decisions made by the IGCI models andfound that they made several incorrect decisions when thethreshold was large. Such decisions with a large thresholdare punished severely when using the AUCmetric. As shownin Fig. 2, although the accuracies of the IGCI models startedfrom 1.0, they dropped sharply when the decision rate wasbetween 0 and 0.2. An incorrect decision with a low deci-sion rate was not punished much when evaluating accuracyfor different decision rates. However, for the AUC, an incor-rect decision when the threshold was large was punishedmore than when the threshold was small. For these reasons,the starting point of the ROC curve for the IGCI modelsin Fig. 3 was shifted to the right, making the AUC lessthan 0.5. In Sect. 5.4, we will discuss why the IGCI modelsfailed.

5.3 Algorithm efficiency

Besides comparing the accuracy and ROC of the three mod-els, we also compared the average time for the algorithmto make a decision.10 We performed the experiment on theMATLAB platform with an Intel Core i7-4770 3.40 GHzCPU and 8 GB memory. From Table 2, we can see thatthe IGCI models were the most efficient one, while thePNL model was the least efficient. The ANM was in themiddle. The longer time to make a decision for the PNLmodel was due to the estimation procedure of f −1 and g inEq. 2.

10 Compared to our previous report [23], we reduced the programoutput so that the PNL model works faster. We have updated the resultsfor time to make a decision for the PNL model accordingly.

123


Fig. 3 ROC of three models. Four graphs are shown because IGCI hastwo explicit forms

Table 1 AUC of three models Model AUC

ANM 0.6263

PNL 0.6120

IGCI (entropy) 0.4935

IGCI (slope) 0.4271

Table 2 Time to make a decision

Model Time (s)

Mean Variance

ANM 10.7 7.4

PNL 60.0 20.3

IGCI (entropy) 0.0014 0.0019

IGCI (slope) 0.0014 0.0017

5.4 Typical examples of model failure

5.4.1 Discretization

In Sect. 5.1, we explained that the PNLmodel gives an incor-rect decision when the threshold is set the highest, i.e., theaccuracy for different decision rates starts from0.0.We inves-tigated the reason for the PNL model making an incorrectdecision when the threshold was the highest. We found thatit happens due to the discretization of data. A scatter plot fora pair of variables (pair0070) is shown in Fig. 4. The datahave two variables. Variable x1 is a value between 0 and 14reflecting the features of an artificial face. It is used to decidewhether the face is that of a man or a woman. A value of0 means very female, and a value of 14 means very male.Variable x2 is a value of 0 (female) or 1 (male) reflectingthe gender decision. Since variable x2 has only two values,no matter what nonlinear transformation is made to x2, thetransformation result is two discretized values. According tothe mechanism of the PNL model, x2 with two discretizedvalues is inferred to be the cause since the independency islarger if x2 is the cause. In fact, for this pair of variables, allthree models made incorrect decisions in our experiments.For the ANM, the discretization of data makes regressionanalysis difficult. And the poor regression results negativelyaffect testing of the independence between the assumed causeand residuals using HSIC. For IGCI, to make Eqs. 3 and 5meaningful, the repetitive data have to be filtered out. Thismeans that only a few data points are actually used in thefinal IGCI calculation.

5.4.2 Noisiness

In Sect. 5.1, we showed that IGCI had good performancein general. However, its accuracy dropped sharply when the

123


Fig. 4 Example of discretized data in CEP dataset. Variable x1 isbetween 0 (very female) and 14 (very male), reflecting features of artifi-cial face.Variable x2 is 0 (female) or 1 (male), reflecting gender decision

Fig. 5 Example scatter plot of noisy data in CEP. Variable x1 is femalelife expectancy at birth for 20002005. Variable x2 is latitude of birthcountry’s capital (data for China, Russia, and Canada were removed)

decision rate was between 0 and 0.2. The reason for this isthat it made incorrect decisions with high confidence whendealing with pair0056-pair00 63. These eight pairs containmuch noise, which degraded model performance. Moreover,there were outliers for the eight pairs, which greatly affectedthe decision result. A scatter plot for one example pair isshown in Fig. 5 (pair0056). It shows that the two variableshave relatively small correlation and that there are outliers inthe data. The calculation method used in IGCI is such thatthese kinds of outliers affect the inference result more thanthe other data points. The incorrect decisions that IGCI madeabout pair0056-pair0063 account for the smallAUCvalue forthe IGCI models given in Sect. 5.2. The ANM also made anincorrect inference about these pairs. This is because noiseand outliers make overfitting more likely to occur for thesevariables. For these noisy pairs, the PNL model had the bestperformance.

5.5 Response to spurious correlation caused by“confounding”

A causal relationship differs from a statistical one, and a sta-tistical relationship is usually not enough to explain a causalone. Even if we observe that two variables are highly corre-lated, we cannot say that they have a causal relationship.

As shown in Table 4 in “Appendix,” the causal directionof the variables in CEP is obvious by common sense. Thedataset has been collected for evaluating causal discoverymodels, and the causal direction ofmost pairs is obvious fromknowledge.However, the relationships of variables that are ofgeneral interest in the real world are usually more controver-sial, e.g., smoking and lung cancer, for which the existence ofconfounding is usually a bone of contention. A good causaldiscovery model for two variables should have the ability toavoid the effect of “confounding” to some degree. To test howthe ANM, the PNL model, and IGCI perform when dealingwith spurious correlation caused by confounding, we firstsimulated the “common cause” case shown in Fig. 1. Wecontrolled the data generating process to simulate differentdegrees of confounding. In addition to simulation, we usedreal-world data from the CEP to evaluate model performancewhen dealing with real-world data.

5.5.1 Simulation

We conducted simulation experiments of the “commoncause” confounding case shown in Fig. 1. We generateddata using two equations: x = a × z3 + b × n1 andy = a × z + b × n2, where z, n1, n2 ∈ U (0, 1). We usedthe quotient a/b to control the degree of confounding. Therewas no direct causal relationship between variables x and yexcept for the various degrees of confounding. Scatter plotsof the data generated using five different quotients are shownin Fig. 6.

We used the generated data to test the performance ofthe three models. We performed ten random experimentsfor each quotient and calculated the average of the inferredresults. The experimental results are shown in Figs. 7, 8,and Table 3. For IGCI, when the degree of confoundingwas low, the mean of the estimated results11 was almostzero. Each estimated result got a positive or negative valuerandomly. As the degree of confounding increased, the esti-mated results tended to approve “Y causes X.” For the PNLmodel, the independence assumptions for both directionswere accepted when a/b = 0.1, 1, 10 (α = 0.01) andthe means of the test statistics (Equation 4 in [11]) werealmost equal. When a/b = 100, the PNL model rejected theindependence assumption for direction “X causes Y” andaccepted that “Y causes X.” When a/b was even larger, theindependence assumptions were rejected for both directions,especially that of “X causes Y.” For the ANM, when thedegree of confounding was low, the independence assump-

11 We used the difference between VX→Y and VY→X (Eqs. 4, 5) as theestimated result. For IGCI, VX→Y is the opposite of VY→X if there isno repetitive data. Thus, we can infer that the correct causal directionis X → Y if the estimated result is negative and that Y → X is correctif the estimated result is positive.

123


Fig. 6 Scatter plots of generated data. a a/b: 0.1, b a/b: 1, c a/b: 10, da/b: 100, e a/b: 1000

tions were accepted for both directions.When the degree washigh, the independence assumptions were rejected for bothdirections, especially “X causes Y.” From these results, we

Fig. 7 Estimated results of IGCI for generated data. Inference was “Ycauses X” when estimated result was larger than 0

Fig. 8 Test statistics for PNL model for generated data

can see that the IGCI is the most susceptible of the threemodels tested to confounding, while the PNL model and theANM can avoid the effect of confounding to a certain extent.

123


Table 3 P value for ANM for two directions

a/b x → y assumed y → x assumed

0.1 0.7044 0.7109

1 0.2837 0.1038

10 1.03E−49 1.47E−16

100 4.32E−103 7.24E−18

1000 4.46E−179 7.70E−21

Fig. 9 Results for “common cause” casewith real-world data. Variablex1: “length”; variable x2: “diameter”

5.5.2 Real-world data

In addition to the simulation experiments described above,we conducted experiments using real-world data. The gen-eration of real- world data can be very complex, whichincreases the difficulty of the causal discovery task. Herewe describe the “common cause” and “selection bias” cases(Fig. 1).

Common cause case Although data for “common cause” arenot included in the CEP dataset, some CEP pairs containthe same cause, as shown in “Appendix.” We combined datacontaining the same cause to obtain pairs of variables, such as“length and diameter.”12 A scatter plot of the results is shownin Fig. 9 for the pair “length, diameter.” For the ANM, the pvalue for the forward direction, “length causes diameter,”was8.28×10−5 while that for the backward direction was 1.05×10−3. For the PNL model, the independence test statisticwas 1.11 × 10−3 for the forward direction and 9.50 × 10−4

for the backward direction. For IGCI, the result estimatedby calculating the entropy was 0.2197, while the other was0.056. For this pair, although the tendency was not strong,the three models tended to approve “diameter is the cause oflength.”

Selection bias case Although there is no causal relationshipbetween X and Y in the selection bias case, independence

12 The pair “length, diameter”was created from“cause: rings (abalone),effect: length” and “cause: rings (abalone), effect: diameter” using datafor abalone [24].

Fig. 10 Results for “selection bias” casewith real-world data. Variablex1: “altitude”; variable x2: “longitude”

does not hold between X and Y when conditioned on vari-able Z. This is the well-known Berkson paradox.13 We usedthe variables “altitude and longitude” (Fig. 10) contained inCEP to test how the three models perform when dealing withthe “selection bias” case. The variables were obtained bycombining “cause: altitude, effect: temperature” and “cause:longitude, effect: temperature.” The data came from 349weather stations in Germany [27]. A scatter plot of the resultsis shown in Fig. 10. For the ANM, the p value for the for-ward direction, “altitude causes longitude,” was 4.21×10−2

while that for the backward direction was 8.73 × 10−2. Theindependence assumptions were accepted for both directionsalthough “longitude causes altitude” was favored. For thePNLmodel, the test statistic was 2.46×10−3 for the forwarddirection and 3.30 × 10−3 for the backward direction. Theindependence assumptionswere accepted for both directions,and the independence test results were similar. For IGCI,the result estimated by calculating the entropy was 1.2742and that estimated by calculating the slope was 1.9032. Bothresults were positive, and the estimated causal direction was“longitude causes altitude.” Although there should be nocausal relationship between “altitude” and “longitude,” itwashard for the three models to determine that from the limitedamount of observational data available.

For most cases of causal discovery in the real world,only limited observational data can be obtained, and in somecases the data are incomplete. Moreover, for the case of twovariables, the causal sufficiency assumption [36] is easilyviolated if there is an unobserved common cause. The lim-ited amount of data and unobserved confoundersmake causaldiscovery in the bivariate case challenging.

6 Conclusion

We compared three state-of-the-art models (ANM, PNLmodel, IGCI) for causal discovery in the binary case using

13 https://en.wikipedia.org/wiki/Berkson’s_paradox.

123

https://en.wikipedia.org/wiki/Berkson's_paradox


simulation and real-world data. Testing using different deci-sion rates showed that the threemodels had similar accuracieswhen all the decisions were made. To check whether thedecisions made were reasonable, we used a binary classi-fier metric: the area under the ROC curve (AUC). The IGCImodel had a small AUC value because it made several incor-rect decisions when the threshold was high. Compared withthose of the other models, the accuracy of the ANM was rel-atively stable. A comparison of the time to make a decisionshowed that IGCI was the fastest even when the dataset waslarge. The PNLmodel took the most time to make a decision.

Of the three models, the IGCI one had the best perfor-mance when there was little noise and the data were notdiscretized much. Improving the performance of the IGCImodel when there is much noise and how to deal with dis-cretized data are future tasks. Although the performance ofthe ANMwas relatively stable, overfitting should be avoidedfor ANM because it will negatively affect the subsequent

independence test. Of the three models, the PNL model isthe most generalized one as it takes into account the nonlin-ear effect of causes, inner additive noise, and external sensordistortion. However, estimation procedure of g() and f −1()

is a lengthy procedure. Finally, testing the responses of themodels to “confounding” showed that the ANM and the PNLmodel can avoid the effect of “confounding” to some degree,while IGCI is the most susceptible to confounding.

Acknowledgements We thank the anonymous reviewers for their help-ful comments to improve the paper. Thework was supported in part by aGrant-in-Aid for Scientific Research (15K12148) from the Japan Soci-ety for the Promotion of Science.

Appendix: Pairs of variables used in experiment

See Table 4.

Table 4 Pairs of causalvariables

Cause Effect

Altitude Temperature

Altitude Precipitation

Longitude Temperature

Altitude Sunshine

Rings (abalone) Length

Rings (abalone) Shell weight

Rings (abalone) Diameter

Rings (abalone) Height

Rings (abalone) Whole weight

Rings (abalone) Shucked weight

Rings (abalone) Viscera weight

Age Body mass index

Age 2-hour serum insulin

Age Diastolic blood pressure

Age Plasma glucose concentration at 2 hours in an oralglucose tolerance test

Age Height

Age Weight

Age Heart rate

Stock return of Hang Seng Bank Stock return of HSBC Hldgs

Stock return of Hutchison Holdings Stock return of Cheung Kong Holdings

Stock return of Cheung Kong Holdings Stock return of Sun Hung Kai Properties

Open http connections during that minute Bytes sent during that minute

Parameter between 0 and 14 Gender decision

Superplasticizer Compressive strength

Coarse aggregate Compressive strength

Water Compressive strength

Age (day) Compressive strength

Alcohol consumption mcv (mean corpuscular volume)

123


Table 4 continuedCause Effect

Alcohol consumption alkphos (alkaline phosphotas)

Alcohol consumption sgpt (alanine aminotransferase)

Alcohol consumption sgot (aspartate aminotransferase)

Alcohol consumption gammagt (gamma-glutamyl transpeptidase)

Displacement mpg

Outdoor temperature Indoor temperature

Temperature Ozone concentration

Sunspot area Global mean temperature anomalies

Energy use CO2 emissions

GNI (gross national income) Life expectancy

GNI (gross national income) Under-5 mortality rate

Average annual rate of change in population Average annual rate of change in total dietaryconsumption for total population

Solar radiation Daily average temperature

PPFD (photosynthetic photon flux density) NEP (net ecosystem productivity)

PPFDdif (photosynthetic photon flux density,diffusive)

NEP (net ecosystem productivity)

PPFDdir (photosynthetic photon flux density,direct)

NEP (net ecosystem productivity)

Horsepower Acceleration

Age Dividends from stock

Age of child in years Concentration of GAG (glycosaminoglycan)

Duration of eruption in minutes Time to next eruption in minutes

Latitude Temperature

Longitude Precipitation

Cement Compressive strength

Blast furnace slag Compressive strength

Temperature CO2 flux at night

Population Employment

Time to take weekly measurements Protein content of milk produced by each cow attime X

Size of room Rent

Mean temperature Total snow

Age Relative spinal bone mineral density

Mass loss April Mass loss October

Day of the year Mean daily temperature in Furtwangen

Weight mpg

Clay content in soil Soil moisture at 10 cm depth

Hour of the day Temperature in degrees Celsius

Clay content Organic C content in soil

Hour of the day Load (total electricity consumption in a region ofTurkey)

Fine aggregate Compressive strength

Initial speed Subsequent speed

Average precipitation from 1948 to 2004 inmm/day

Average runoff from 1948 to 2004 in mm/day

(Mean daily air temperature) year 2000, day 50 (Mean daily air temperature) year 2000, day 51

(Mean daily pressure at surface) year 2000, day50

(Mean daily pressure at surface) year 2000, day 51

123


Table 4 continuedCause Effect

(Mean daily sea-level pressure) year 2000, day 50 (Mean daily sea-level pressure) year 2000, day 51

(Mean daily relative humidity) year 2000, day 50 (Mean daily relative humidity) year 2000, day 51

Categorical (1: Sundays+Holidays or 2: workingdays)

Number of cars

Latitude of birth country’s capital Life expectancy at birth

Population with sustainable access to improveddrinking water sources

Infant mortality rate

Age Wage per hour

References

1. Chen, Y., Rangarajan, G., Feng, J., Ding, M.: Analyzing multiplenonlinear time series with extended granger causality. Phys. Lett.A 324(1), 26–35 (2004)

2. Chen, Z., Zhang, K., Chan, L.: Causal discoverywith scale-mixturemodel for spatiotemporal variance dependencies. In: Advances inNeural Information Processing Systems, pp. 1727–1735 (2012)

3. Chen, Z., Zhang, K., Chan, L., Schölkopf, B.: Causal discoveryvia reproducing kernel hilbert space embeddings. Neural Comput.26(7), 1484–1517 (2014)

4. Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel, B.,Zhang, K., Schölkopf, B.: Inferring deterministic causal relations.arXiv preprint arXiv:1203.3475 (2012)

5. Eberhardt, F.: Introduction to the foundations of causal discovery.Int. J. Data Sci. Anal., 1–11 (2017)

6. Gong, M., Zhang, K., Schoelkopf, B., Tao, D., Geiger, P.: Dis-covering temporal causal relations from subsampled data. In:Proceedings of 32th International Conference on Machine Learn-ing (ICML 2015) (2015)

7. Granger, C.W.: Investigating causal relations by econometric mod-els and cross-spectral methods. Econom. J. Econom. Soc. 37,424–438 (1969)

8. Granger, C.W.: Testing for causality: a personal viewpoint. J. Econ.Dyn. Control 2, 329–352 (1980)

9. Granger, C.W.: Some recent development in a concept of causality.J. Econom. 39(1), 199–211 (1988)

10. Gretton, A., Herbrich, R., Smola, A., Bousquet, O., Schölkopf, B.:Kernel methods for measuring independence. J. Mach. Learn. Res.6(Dec), 2075–2129 (2005)

11. Gretton, A., Fukumizu, K., Teo, C.H., Song, L., Schölkopf, B.,Smola, A.J., et al.: A kernel statistical test of independence. NIPS20, 585–592 (2008)

12. Halpern, J.Y.: A modification of the halpern-pearl definition ofcausality. arXiv preprint arXiv:1505.00162 (2015)

13. Howards, P.P., Schisterman, E.F., Poole, C., Kaufman, J.S., Wein-berg, C.R.: “Toward a clearer definition of confounding” revisitedwith directed acyclic graphs. Am. J. Epidemiol. 176(6), 506–511(2012)

14. Hoyer, P.O., Shimizu, S., Kerminen, A.J., Palviainen, M.: Estima-tion of causal effects using linear non-gaussian causal models withhidden variables. Int. J. Approx. Reason. 49(2), 362–378 (2008)

15. Hoyer, P.O., Janzing, D., Mooij, J.M., Peters, J., Schölkopf,B.: Nonlinear causal discovery with additive noise models. In:Advances in neural information processing systems, pp. 689–696(2009)

16. Huang, B., Zhang, K., Schölkopf, B.: Identification of time-dependent causalmodel:Agaussianprocess treatment. In:The24thInternational Joint Conference on Artificial Intelligence, MachineLearning Track, pp. 3561–3568. Buenos, Argentina (2015)

17. Hyvärinen, A., Oja, E.: Independent component analysis: algo-rithms and applications. Neural Netw. 13(4), 411–430 (2000)

18. Hyvärinen, A., Zhang, K., Shimizu, S., Hoyer, P.O.: Estimation ofa structural vector autoregression model using non-gaussianity. J.Mach. Learn. Res. 11(May), 1709–1731 (2010)

19. Janzing, D., Scholkopf, B.: Causal inference using the algorithmicmarkov condition. IEEE Trans. Inf. Theory 56(10), 5168–5194(2010)

20. Janzing, D., Hoyer, P.O., Schölkopf, B.: Telling cause fromeffect based on high-dimensional observations. arXiv preprintarXiv:0909.4386 (2009)

21. Janzing, D., MPG, T., Schölkopf, B.: Causality: Objectives andassessment (2010)

22. Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J.,Daniušis, P., Steudel, B., Schölkopf, B.: Information-geometricapproach to inferring causal directions. Artif. Intell. 182, 1–31(2012)

23. Jing, S., Satoshi, O., Haruhiko, S., Masahito, K.: Evaluation ofcausal discovery models in bivariate case using real world data.In: Proceedings of the International MultiConference of Engineersand Computer Scientists 2016, pp. 291–296 (2016)

24. Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)

25. Lopez-Paz, D., Muandet, K., Schölkopf, B., Tolstikhin, I.: Towardsa learning theory of cause-effect inference. In: Proceedings ofthe 32nd International Conference on Machine Learning. JMLR:W&CP, Lille, France (2015)

26. Mooij, J.M., Janzing, D., Schölkopf, B.: Distinguishing betweencause and effect. In: NIPS Causality: Objectives and Assessment,pp. 147–156 (2010)

27. Mooij J.M., Janzing, D., Zscheischler, J., Schölkopf, B.:Cause effect pairs repository. https://webdav.tuebingen.mpg.de/cause-effect/ (2014a)

28. Mooij, J.M., Peters, J., Janzing, D., Zscheischler, J., Schölkopf, B.:Distinguishing cause from effect using observational data:methodsand benchmarks. arXiv preprint arXiv:1412.3773 (2014b)

29. Pearl, J., et al.: Models, reasoning and inference (2000)30. Peters, J., Janzing, D., Gretton, A., Schölkopf, B.: Detecting

the direction of causal time series. In: Proceedings of the 26thAnnual International Conference on Machine Learning, pp. 801–808. ACM, New York (2009)

31. Peters, J., Mooij, J., Janzing, D., Schölkopf, B.: Identifiabil-ity of causal graphs using functional models. arXiv preprintarXiv:1202.3757 (2012)

32. Peters, J., Janzing, D., Schölkopf, B.: Causal inference on timeseries using restricted structural equation models. In: Advances inNeural Information Processing Systems, pp. 154–162 (2013)

33. Rasmussen, C.E.: Gaussian Processes for Machine Learning. MITpress, Cambridge (2006)

123

http://arxiv.org/abs/1203.3475



http://archive.ics.uci.edu/ml

http://archive.ics.uci.edu/ml

https://webdav.tuebingen.mpg.de/cause-effect/

https://webdav.tuebingen.mpg.de/cause-effect/




34. Rasmussen, C.E., Nickisch, H.: Gaussian processes for machinelearning (gpml) toolbox. J.Mach. Learn. Res. 11(Nov), 3011–3015(2010a)

35. Rasmussen, C.E., Nickisch, H.: GPML code. http://www.gaussianprocess.org/gpml/code/matlab/doc/ (2010b)

36. Scheines, R.: An introduction to causal inference (1997)37. Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K.,

Mooij, J.: On causal and anticausal learning. arXiv preprintarXiv:1206.6471 (2012)

38. Sgouritsa, E., Janzing, D., Hennig, P., Schölkopf, B.: Inference ofcause and effectwith unsupervised inverse regression. In:AISTATS(2015)

39. Shajarisales, N., Janzing, D., Schoelkopf, B., Besserve,M.: Tellingcause from effect in deterministic linear dynamical systems. arXivpreprint arXiv:1503.01299 (2015)

40. Shimizu, S., Bollen, K.: Bayesian estimation of causal directionin acyclic structural equation models with individual-specific con-founder variables and non-gaussian distributions. J. Mach. Learn.Res. 15(1), 2629–2652 (2014)

41. Shimizu, S., Hoyer, P.O., Hyvärinen, A., Kerminen, A.: A linearnon-gaussian acyclic model for causal discovery. J. Mach. Learn.Res. 7(Oct), 2003–2030 (2006)

42. Shimizu, S., Inazumi, T., Sogawa, Y., Hyvärinen, A., Kawahara, Y.,Washio, T., Hoyer, P.O., Bollen, K.: Directlingam: a direct methodfor learning a linear non-gaussian structural equation model. J.Mach. Learn. Res. 12(Apr), 1225–1248 (2011)

43. Spirtes, P., Zhang, K.: Causal discovery and inference: conceptsand recent methodological advances. Appl. Inform. 3, 3 (2016)

44. Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction, andSearch. MIT press, Cambridge (2000)

45. Stegle, O., Janzing, D., Zhang, K., Mooij, J.M., Schölkopf, B.:Probabilistic latent variable models for distinguishing betweencause and effect. In: Advances in Neural Information ProcessingSystems, pp. 1687–1695 (2010)

46. Sun, X., Janzing, D., Schölkopf, B.: Causal inference by choosinggraphs with most plausible markov kernels. In: ISAIM (2006)

47. Sun, X., Janzing, D., Schölkopf, B.: Distinguishing between causeand effect via kernel-based complexity measures for conditionaldistributions. In: ESANN, pp. 441–446 (2007a)

48. Sun, X., Janzing, D., Schölkopf, B., Fukumizu, K.: A kernel-basedcausal learning algorithm. In: Proceedings of the 24th internationalconference on Machine learning, pp. 855–862. ACM, New York(2007b)

49. Sun, X., Janzing, D., Schölkopf, B.: Causal reasoning by evaluat-ing the complexity of conditional densities with kernel methods.Neurocomputing 71(7), 1248–1256 (2008)

50. Tillman, R.E., Gretton, A., Spirtes, P.: Nonlinear directed acyclicstructure learningwith weakly additive noisemodels. In: Advancesin Neural Information Processing Systems, pp. 1847–1855 (2009)

51. Weinberg, C.R.: Toward a clearer definition of confounding. Am.J. Epidemiol. 137(1), 1–8 (1993)

52. Zhang, K., Chan, L.W.: Extensions of ICA for causality discov-ery in the Hong Kong stock market. In: International Conferenceon Neural Information Processing, pp. 400–409. Springer, Berlin(2006)

53. Zhang, K., Hyvärinen, A.: Distinguishing causes from effects usingnonlinear acyclic causal models. In: Journal of Machine Learn-ing Research, Workshop and Conference Proceedings (NIPS 2008Causality Workshop), vol. 6, pp. 157–164 (2008)

54. Zhang, K., Hyvärinen, A.: On the identifiability of the post-nonlinear causalmodel. In: Proceedings of theTwenty-fifthConfer-ence on Uncertainty in Artificial Intelligence, pp. 647–655. AUAIPress, Corvallis (2009)

55. Zhang, K., Schölkopf, B., Janzing, D.: Invariant gaussian processlatent variable models and application in causal discovery. arXivpreprint arXiv:1203.3534 (2012)

56. Zhang, K., Wang, Z., Schölkopf, B.: On estimation of functionalcausalmodels: post-nonlinear causalmodel as an example. In: 2013IEEE 13th International Conference on Data Mining Workshops,pp. 139–146. IEEE (2013)

57. Zhang, K., Zhang, J., Schölkopf, B.: Distinguishing cause fromeffect based on exogeneity. arXiv preprint arXiv:1504.05651(2015)

123

http://www.gaussianprocess.org/gpml/code/matlab/doc/

http://www.gaussianprocess.org/gpml/code/matlab/doc/





Tell cause from effect: models and evaluation - Springer · PDF fileInt J Data Sci Anal (2017)...

Documents

Transcript of Tell cause from effect: models and evaluation - Springer · PDF fileInt J Data Sci Anal (2017)...