Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal...

19
Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess Simon G. Thompson * December 16, 2010 Abstract Mendelian randomization studies using genetic instrumental variables are now being commonly used to estimate the causal association of a phenotype on an outcome. Even when the necessary underlying assumptions are valid, estimates from analyses using instrumental variables are biased in finite samples. The source and nature of this bias appear poorly understood in the epidemiological field. We explain why the bias is in the direction of the confounded observational association, with magnitude relating to the statistical strength of association between the instrument and phenotype. We comment on the size of the bias, from simulated data, showing that when multiple instruments are used, although the variance of the instrumental variable estimator decreases, the bias increases. We discuss ways to analyse Mendelian randomization studies to alleviate the problem of weak instrument bias. Keywords: Mendelian randomization; instrumental variables; weak instruments; causal inference 1 Introduction In traditional observational epidemiological studies, associations between an exposure or phenotype and outcome are often biased by reverse causation and unmeasured confounding [1]. Instrumental variable (IV) techniques have been proposed as a way to assess the true causal association where direct experiment is not possible, and using genetic markers as IVs has been coined “Mendelian randomization” [2]. Mendel’s Second Law states that “unlinked or distantly linked segregating gene pairs assort independently at meiosis” [3]. Mendelian randomization studies therefore seek to use gene variants associated with the phenotype of interest to define ‘randomized’ subgroups of the population that have different levels of the phenotype, but do not differ systematically in any other factor. Methods based on genetic IVs are now being * Both authors are supported by the UK Medical Research Council (grant U.1052.00.001) 1

Transcript of Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal...

Page 1: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

Bias in causal estimates from Mendelianrandomization studies with weak instruments

Stephen Burgess Simon G. Thompson ∗

December 16, 2010

Abstract

Mendelian randomization studies using genetic instrumental variables are now beingcommonly used to estimate the causal association of a phenotype on an outcome.Even when the necessary underlying assumptions are valid, estimates from analysesusing instrumental variables are biased in finite samples. The source and nature of thisbias appear poorly understood in the epidemiological field. We explain why the bias isin the direction of the confounded observational association, with magnitude relatingto the statistical strength of association between the instrument and phenotype. Wecomment on the size of the bias, from simulated data, showing that when multipleinstruments are used, although the variance of the instrumental variable estimatordecreases, the bias increases. We discuss ways to analyse Mendelian randomizationstudies to alleviate the problem of weak instrument bias.

Keywords: Mendelian randomization; instrumental variables; weak instruments;causal inference

1 Introduction

In traditional observational epidemiological studies, associations between an exposureor phenotype and outcome are often biased by reverse causation and unmeasuredconfounding [1]. Instrumental variable (IV) techniques have been proposed as a wayto assess the true causal association where direct experiment is not possible, and usinggenetic markers as IVs has been coined “Mendelian randomization” [2]. Mendel’sSecond Law states that “unlinked or distantly linked segregating gene pairs assortindependently at meiosis” [3]. Mendelian randomization studies therefore seek touse gene variants associated with the phenotype of interest to define ‘randomized’subgroups of the population that have different levels of the phenotype, but do notdiffer systematically in any other factor. Methods based on genetic IVs are now being

∗Both authors are supported by the UK Medical Research Council (grant U.1052.00.001)

1

Page 2: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

extensively used in practice. For example, they have been used for estimating thecausal relationship of adiposity on bone mass [4], of lipoprotein(a) on coronary heartdisease [5], and of alcohol intake on blood pressure [6].

By finding a genotype (G) that is associated only with the phenotype of interest(X), we can form genotypic subgroups in the population which are analogous to treat-ment groups in a randomized trial [7]. These genotypic subgroups have different levelsof the phenotype, but do not have systematically different levels of any confounders(U). As genotype is determined at conception, any difference in outcome (Y) betweengenetic groups is due to long-term changes in phenotype levels and not to reversecausation [8]. Formally, an IV is a variable G which is [9]:

a) independent of any possible confounders (ie. G is independent of U),

b) associated with the exposure (ie. G is not independent of X) and

c) independent of the outcome given the exposure and confounders (ie. Y is inde-pendent of G given X and U).

These assumptions underlying an IV analysis are depicted in the directed acyclicgraph in Figure 1 [2, 8, 9]. There are many ways in which these assumptions maybe violated, which have been discussed at length elsewhere [2, 3]. Throughout thispaper, we assume that the assumptions are satisfied by the given instrument.

G X

U

Y1

2

1

2

Figure 1: Directed acyclic graph of Mendelian randomization assumptions

Several methods can be used to estimate the causal effect of change in outcomefor unit change in phenotype. In this paper, we focus on the ratio of coefficients (orWald) method [8, 10] and the two-stage least squares (2SLS) method [11]. In theratio of coefficients method, the causal estimate for the effect of X on Y is derived asthe ratio of the regression coefficient of Y on G to that of X on G. In 2SLS, first X isregressed on G to give fitted values X|G, and then the causal estimate is derived asthe regression coefficient of Y on these fitted values X|G. Both 2SLS and the ratiomethod are applicable with one instrument, in which case the causal estimates areidentical, but 2SLS can also be used with multiple instruments.

A ‘weak instrument’ is defined as an instrument for which the statistical evidenceof association with the phenotype (X) is not strong [8]. An instrument can be weakif it explains a small amount of the variation of the phenotype, where the amountdefined as ‘small’ depends on the sample size. The F statistic in the first stage

2

Page 3: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

regression of X on G is usually quoted as a measure of the strength of an instrument[12]. Although IV methods are asymptotically unbiased, they demonstrate systematicfinite sample bias. This bias, known as ‘weak instrument bias’, is in the direction of theconfounded observational association between phenotype and outcome, and dependson the strength of the instrument [13]. Weak instruments are also associated withunderestimated confidence intervals and poor coverage properties [14]. A generallyquoted criterion is that an instrument is weak if the F statistic in the G-X regressionis less than 10 [8, 15]. However, using instruments with F > 10 only reduces bias toless than a certain level, and problems with weak instrument bias still occur [12].

While the topic of weak instrument bias has been discussed for some years in econo-metrics [16, 17, 18], it is not well understood in relation to Mendelian randomization.This paper aims to translate some of the econometrics literature on weak instrumentbias into the language and context of epidemiology [19]. While economists typicallyuse tens or even hundreds of instruments [12], Mendelian randomization studies usu-ally use less than five and often only one instrument. With one instrument, the biasof the ratio IV estimator is undefined, as the estimator does not have a finite mean[13]. Economists often deal with time-series data comprising up to millions of obser-vations, but using the asymptotic distributions of IV estimators may not be valid inan epidemiological context [13].

In this paper, we demonstrate the direction and magnitude of the bias in IVestimation from simulated data, showing that it can be an important issue in practice(section 2). We explain why this bias comes about, why it is in the direction of theconfounded observational association and why it is related to instrument strength(section 3). We quantify the size of this bias and describe how important it wouldbe expected to be in a given application (section 4). When multiple genetic variantsare available, we show how the choice of IV affects the variance and bias of the IVestimator (section 5). We conclude by discussing alternative IV estimation methods(section 6) and analysis strategies for Mendelian randomization studies that reduceweak instrument bias.

2 Demonstrating the bias from IV estimators

We take a simple example of a confounded association, as in Figure 1. Phenotypexi for individual i is a linear combination of a genetic component gi which can takevalues 0 or 1, normally distributed confounder ui, and error εxi terms. Outcome yi isa linear combination of xi and ui with normally distributed error εyi. The true causalassociation of X on Y is represented by β1. To simplify, we have set the constantterms in the equations to be zero:

xi = α1 gi + α2 ui + εxi (1)

yi = β1 xi + β2 ui + εyi

ui ∼ N (0, σ2u); εxi ∼ N (0, σ2

x); εyi ∼ N (0, σ2y) independently

3

Page 4: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

We simulated 50 000 datasets from this model, each with 200 individuals dividedequally between the two genotypic subgroups, for a range of values of α1. We setβ1 = 0, α2 = 1, β2 = 1, σ2

u = σ2x = σ2

y = 1, corresponding to a true null causal associa-tion, but simply regressing Y on X yields a strong positive confounded observationalassociation of close to 0.5. We took 6 different values of α1 from 0.05 to 0.55, thusvarying the strength of the G-X association, corresponding to mean F statistic valuesbetween 1.07 and 8.65.

Causal estimates are calculated using the ratio method, although with a singlelinear instrument the estimates from the ratio and 2SLS methods are the same. Theresulting distributions for the estimate of the causal parameter β1 are shown in Fig-ure 2 and Table 1. Because the IV estimate is the ratio of two normally distributedrandom variables, it does not have a finite mean or variance; so we have expressedresults using quantiles. For smaller values of α1, there is a marked bias in the causalestimate in the positive direction. For the smallest value α1 = 0.05, the mean F statis-tic is barely above its null expectation of 1 and the median IV estimate is close to theconfounded observational estimate. For large values of α1, the causal estimates have askew distribution, with median close to zero but with more extreme causal estimatestending to take negative values. The F statistics vary greatly between simulationsfor each given α1, with an interquartile range of similar size to the mean value of thestatistic (Table 1). In practical applications therefore the F statistic from a singleanalysis is not necessarily a reliable guide to the underlying mean F statistic.

α1 Mean F statistic Quantiles: 2.5% 25% 50% 75% 97.5%(Observed IQ range)

0.05 1.07 (0.11 - 1.41) -10.5393 -0.3859 0.4686 1.3159 10.89180.15 1.58 (0.18 - 2.17) -9.2289 -0.4436 0.2870 0.9819 9.34050.25 2.59 (0.44 - 3.73) -6.4495 -0.4672 0.1296 0.5983 5.82670.35 4.10 (1.17 - 5.94) -4.0480 -0.4124 0.0456 0.3838 2.87760.45 6.12 (2.49 - 8.55) -2.4233 -0.3423 0.0108 0.2806 0.91670.55 8.65 (4.27 - 11.81) -1.5435 -0.2849 0.0002 0.2247 0.6417

Table 1: Quantiles of IV estimates of causal association β1 = 0 using weak instrumentsfrom simulated data

3 Explaining the bias from IV estimators

This bias can be explained in several ways. Firstly, there is a dependence betweenthe numerator and denominator in the ratio estimator. In the zero error case (σ2

x =σ2

y = 0) with true causal association of X on Y, and confounded association throughU, model (1) reduces to:

xi = α1 gi + α2 ui (2)

yi = β1 xi + β2 ui

ui ∼ N (0, σ2u)

4

Page 5: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

α1 = 0.05, F = 1.07

Den

sity

−2 −1 0 1 2

0.0

0.1

0.2

0.3

α1 = 0.15, F = 1.58

Den

sity

−2 −1 0 1 20.

00.

20.

4

α1 = 0.25, F = 2.59

Den

sity

−2 −1 0 1 2

0.0

0.2

0.4

0.6

α1 = 0.35, F = 4.10

Den

sity

−2 −1 0 1 2

0.0

0.2

0.4

0.6

α1 = 0.45, F = 6.12

Den

sity

−2 −1 0 1 2

0.0

0.4

0.8

α1 = 0.55, F = 8.65

Den

sity

−2 −1 0 1 2

0.0

0.4

0.8

Figure 2: Histograms of IV estimates of causal association β1 = 0 using weak instru-ments from simulated data; average F statistics for each value of α1 are shown.

5

Page 6: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

If uj is the average confounder level for genotypic subgroup j, an expression forthe causal association from the ratio method is

βIV = β1 +β2 ∆u

α1 + α2 ∆u

where ∆u = u1 − u0 is normally distributed with expectation zero. When the instru-ment is strong, α1 is large compared to α2 ∆u, and the expression βIV will be close toβ1. When the instrument is weak, α1 may be small compared to α2 ∆u, and the biasβIV − β1 close to β2

α2, which is the confounded observational association. This is true

whether ∆u is positive or negative. Figure 3 (left panel), reproduced from Nelsonand Startz [18], shows how the IV estimate bias varies with ∆u. Although for anynon-zero α1 the IV estimator will be an asymptotically consistent estimator as samplesize increases and ∆u → 0, a bias in the direction of the confounded association willbe present in finite samples. From Figure 3 (left panel), we can see that the medianbias will be positive, as it is positive when ∆u > 0 or ∆u < −α1

α2, which happens

with probability greater than 0.5. When the instrument is weak, the IV is measuringnot the systematic genetic variation in the phenotype, but the chance variation in theconfounders [13]. If there is independent error in x and y, then the picture is similar,but more noisy, as seen in Figure 3 (right panel). Under model (1), the expression forthe IV estimator is

βIV = β1 +β2 ∆u + ∆εy

α1 + α2 ∆u + ∆εx

where ∆εx = εx1 − εx0 and ∆εy = εy1 − εy0 defined as above.

−1.0 −0.5 0.0 0.5 1.0

−6

−4

−2

02

46

8

Difference in confounder: ∆u

Bia

s: β

IV−

β 1

−1.0 −0.5 0.0 0.5 1.0

−6

−4

−2

02

46

8

Difference in confounder: ∆u

Bia

s: β

IV−

β 1

Figure 3: Bias in instrumental variable estimator as a function of the difference inmean confounder between groups (α1 = 0.25, α2 = β2 = 1). Horizontal dottedline is at the confounded association β2

α2, and the vertical dotted line at ∆u = −α1

α2

where βIV is not defined. Left panel: no independent error in x or y, right panel:∆εx, ∆εy ∼ N (0, 0.12) independently.

This also explains the heavier negative tail in the histograms in Figure 2. Theestimator takes extreme values when the denominator α1 + α2 ∆u is close to zero.

6

Page 7: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

Taking parameters α1, α2 and β2 as positive, as in the example of Section 2, this isassociated with a negative value of ∆u, where the numerator of the ratio estimatorwill be negative. As ∆u has expectation zero, the denominator is more likely to besmall and positive than small and negative, giving more negative extreme values ofβIV than positive ones.

As a second explanation, we can think of the bias as a violation of the first IVassumption in a finite sample. Although a valid instrument will be asymptoticallyindependent from all confounders, in a finite sample there will be a non-zero correlationbetween the instrument and confounders. As before, this correlation biases the IVestimator towards the confounded association. This shows that even stochastic (ie.non-systematic) violation of the IV assumptions causes bias.

Thirdly, we can explain the bias graphically. We take model (1) with a negativecausal association between phenotype and outcome (β1 = −0.4), but with positiveconfounding (α2 = 1, β2 = 1, σ2

x = σ2y = 0.2, σ2

u = 1) giving a strong positive observa-tional association between phenotype and outcome. We performed 1000 simulationswith 600 subjects divided equally into 3 genotypic groups (gi ∈ {0, 1, 2}). We tookα1 = (0.5, 0.2, 0.1, 0.05), corresponding to mean F values of (100, 16, 4.7, 2.0). Themean levels of phenotype and outcome for each genotypic group are plotted (Figure 4),giving simulated density functions for each group. In each simulation, we effectivelydraw one point at random from each of these distributions; the gradient of the linethrough these three points is the 2SLS IV estimate. When the instrument is strong,the large phenotypic differences between the groups due to genotypic variation willgenerally lead to estimating a negative effect of phenotype on outcome, whereas whenthe instrument is weak the phenotypic differences between the groups due to geneticvariation are small and the original confounded positive association is more likely tobe recovered.

Weak instrument bias reintroduces the problem that IVs were developed to solve.Weak instruments may convince a researcher that the observational association whichthey have measured is a causal association [13]. The reason for the bias is that thevariation in the phenotype explained by the IV is too small compared to the variationin the phenotype caused by chance correlation between the IV and confounders.

4 Quantifying the bias from IV estimators

Bound et al. [13] use a power-series expansion to show that the bias in the IV es-timator is related to the F statistic (F ) from the G-X regression. They show thatas F decreases, the bias of the IV estimator approaches the bias of the confoundedassociation. Staiger and Stock [15] consider “weak instrument asymptotics”, whereas the sample size increases, the coefficients in the G-X regression tend to zero andspecifically are O(n−

12 ), where n is the sample size. Then, as the sample size tends to

infinity, the F statistic from the G-X regression tends to a finite limit. They considerthe relative mean bias B, which is the ratio of the bias of the IV estimator to the bias

7

Page 8: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

0.0 0.5 1.0

−0.

5−

0.3

−0.

10.

1

Strong instrument (F=100)

Phenotype

Out

com

e

−0.2 0.0 0.2 0.4 0.6

−0.

3−

0.1

0.0

0.1

Moderate instrument (F=16)

Phenotype

Out

com

e

−0.2 0.0 0.1 0.2 0.3 0.4

−0.

20−

0.10

0.00

0.10

Weak instrument (F=4.7)

Phenotype

Out

com

e

−0.2 0.0 0.1 0.2 0.3

−0.

15−

0.05

0.05

0.15

Very weak instrument (F=2.0)

Phenotype

Out

com

e

Figure 4: Simulation of mean outcome and mean phenotype level in three genotypicgroups for various strengths of instrument

8

Page 9: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

of the confounded association βOBS found by linear regression of Y on X:

B =E(βIV )− β1

E(βOBS)− β1

(3)

This measure has the advantage of invariance under change of units in Y. The relativemean bias in this case from the 2SLS method is asymptotically approximately equalto 1/F [15].

Stock and Yogo [14] have tested the accuracy of this approximation by tabulatinga series of critical values derived from simulations of the required F statistic to ensure,with 95% confidence, a relative mean bias of less than 5%, 10%, 20% and 30% with agiven number of instruments. They show that, while this approximation is reasonablefor a large number of instruments, it is less accurate when there are few instruments,as there typically are in an epidemiological context. Indeed, they do not estimate therelative mean bias when the number of instruments is less than 3, since the 2SLS IVestimator and hence B has no finite kth moment when the number of instruments isless than or equal to k [20].

With a single instrument, the mean of the IV estimator is not defined, so tocompare bias in this setting, we instead consider the relative median bias. This isa novel measure formed by replacing the expectations in equation (3) with mediansacross simulations [21].

B∗ =median(βIV )− β1

median(βOBS)− β1

(4)

To investigate the size of the bias when there are few instruments, we take bothmodel (1) with one genetic variable and a similar model except with three geneticvariables g1, g2 and g3:

xi =3∑1

α1k gik + α2 ui + εxi (5)

yi = β1 xi + β2 ui + εyi

ui ∼ N (0, σ2u); εxi ∼ N (0, σ2

x); εyi ∼ N (0, σ2y) independently

In model (5), each IV is taken as dichotomous, giving 8 possible genotype combina-tions. We simulated 100 000 datasets from this model for each set of parameters with200 individuals divided equally between the 8 genotypic subgroups, meaning that theinstruments are uncorrelated. Model (1) was treated similarly, except the 200 individ-uals were divided into 2 genotypic subgroups. We considered four scenarios coveringa range of typical situations, with σ2

x = σ2y = σ2

u = 1 throughout:

a) null causal effect, moderate positive confounding (β1 = 0, α2 = 1, β2 = 2);

b) null causal effect, strong positive confounding (β1 = 0, α2 = 1, β2 = 4);

9

Page 10: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

c) negative causal effect, moderate positive confounding (β1 = −1, α2 = 1, β2 = 2);

d) negative causal effect, moderate negative confounding (β1 = −1, α2 = 1, β2 =−2).

We took six values of α1 = α11 = α12 = α13 from 0.1 to 0.6, corresponding to differentstrengths of instrument with mean F3,196 values from 1.3 to 10.1. For each sample we

calculated the IV estimator βIV using the 2SLS method, and the confounded estimateβOBS by linear regression.

Table 2 shows how the relative mean and median bias across simulations vary fordifferent strengths of instrument. For three IVs, especially for stronger instruments,the relative median bias is larger than the relative mean bias. This is because the IVestimator has a negatively skewed distribution, as shown in Figure 2, and the skewnessis more marked as the instrument becomes stronger. We can see that 1/F seems tobe a good, if slightly conservative, estimate for the relative median bias, agreeingwith Staiger and Stock [15]. A mean F statistic of 10 would on average limit theIV estimator bias to 10% of the bias of the confounded association. For a single IV,the relative median bias is lower than for three IVs, and substantially so for strongerinstruments. Although the distribution of the IV estimator for a single instrument isskew and heavy-tailed, the relative median bias is around 5% or less for even fairlyweak instruments with F = 5 [21].

α1 0.1 0.2 0.3 0.4 0.5 0.6Mean F statistic 1.26 2.02 3.29 5.06 7.33 10.1

1/F 0.79 0.49 0.30 0.198 0.136 0.099a) Null causal effect, moderate positive confounding

Relative mean bias with 3 IVs 0.78 0.42 0.19 0.101 0.063 0.044Relative median bias with 3 IVs 0.79 0.46 0.26 0.163 0.112 0.080Relative median bias with 1 IV 0.78 0.38 0.15 0.038 0.010 0.002

b) Null causal effect, strong positive confoundingRelative mean bias with 3 IVs 0.79 0.41 0.19 0.099 0.062 0.042

Relative median bias with 3 IVs 0.79 0.46 0.26 0.162 0.109 0.079Relative median bias with 1 IV 0.77 0.38 0.14 0.041 0.011 0.002

c) Negative causal effect, moderate positive confoundingRelative mean bias with 3 IVs 0.79 0.42 0.20 0.100 0.061 0.042

Relative median bias with 3 IVs 0.79 0.47 0.27 0.164 0.110 0.081Relative median bias with 1 IV 0.77 0.40 0.15 0.040 0.011 0.002

d) Negative causal effect, moderate negative confoundingRelative mean bias with 3 IVs 0.79 0.41 0.19 0.098 0.062 0.044

Relative median bias with 3 IVs 0.80 0.46 0.26 0.160 0.110 0.081Relative median bias with 1 IV 0.80 0.39 0.15 0.035 0.011 -0.000

Table 2: Relative mean and median bias of the 2SLS IV estimator across 100 000simulations for different strengths of instrument using three IVs and one IV. MeanF3,196 and F1,198 statistics are equal to 2 decimal places.

Although these results show that for a mean F value of 10 we have a relative

10

Page 11: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

median bias of less than 10%, there is no guarantee that if we have observed an Fstatistic of 10 or greater from data that the mean value is 10 or greater. From Table 1,for a mean F value of 4.10, we observe an F value greater than 10 in 8% of simulations,and for a mean F value of 6.12 in 18% of simulations.

5 Choosing a suitable IV estimator

5.1 Multiple candidate IVs

When there are several valid instruments, the choice of IVs used as the instrumentaffects the bias and variability of the IV estimator. Typically, genetic IVs are singlenucleotide polymorphisms (SNPs), which have three different levels. Using a modelidentical to (5) except with six instruments representing SNPs taking values in {0,1, 2} with independent minor allele frequencies of 0.3, we simulated 100 000 datasetsfrom this model with 200 individuals, and as in case a) setting β1 = 0, α2 = 1, β2 =2, σ2

x = σ2y = σ2

u = 1. We sampled random values of α1(= α1k for k = 1, . . . , 6)uniformly from 0.4 to 0.6, corresponding to mean values of F1,199 between 7 and 12.We calculated the causal estimate using each of the IVs individually, all combinationsof two, three, four and five IVs and all IVs together by the 2SLS method. We alsocalculated the causal estimate using only the one IV with the greatest F statistic thatexplained most of the variation in the phenotype in each simulation, and similarlyusing the IV with the lowest F statistic.

Estimate using Quantiles: 25% 50% 75% Mean a Error (MAB) Mean F

1 IV -0.3841 0.0040 0.3008 - 0.3336 9.42 IVs -0.1938 0.0490 0.2553 0.0006 0.2305 9.83 IVs -0.1281 0.0627 0.2313 0.0355 0.1895 10.24 IVs -0.0931 0.0694 0.2159 0.0507 0.1664 10.75 IVs -0.0704 0.0737 0.2053 0.0594 0.1516 11.26 IVs -0.0540 0.0766 0.1976 0.0651 0.1414 11.7

IV with greatest F 0.0404 0.2452 0.4442 - 0.2834 18.1IV with lowest F -1.3392 -0.5864 -0.0558 - 0.7024 3.0

Table 3: Median and quartiles of bias, mean bias and median absolute bias (MAB)of the 2SLS IV estimate of β1 = 0 across 100 000 simulations using combinations ofsix uncorrelated instruments

aMean bias is reported only when it is not theoretically infinite.

Table 3 shows that the more instruments that are used, the greater the medianand mean bias of the IV estimator in the direction of the confounded observationalestimate, even though the mean F statistic is marginally increasing. This is trueeven when, as in this case, the instruments are not correlated and are explainingindependent variation in phenotype levels. The reason is that, with more IVs, thedata are being subdivided in more different ways, and so there is more chance of oneof these divisions giving genotypic groups with different average levels of confounders.

11

Page 12: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

However, the more instruments that are used, the smaller the variability of the IVestimator. This is because a greater proportion of the variance in the phenotype isbeing modelled.

We therefore have a situation analogous to a bias–variance trade-off [22]. As analternative to the mean squared error, we suggest using the median absolute bias(MAB) (median |βIV − β1|) as a criterion for how many instruments should be used.Table 3 shows that in this case, despite the increase in the bias, the estimate usingall six IVs is preferred.

In the simulation, each IV in truth explains the same amount of variation inthe phenotype. If however the IVs used are chosen because they explain a largeproportion of the variation in the phenotype in the data under analysis, then theestimate using these IVs is additionally biased. This is because the IVs explainingthe most variation will be overestimating the proportion of true variation explained,due to chance correlation with confounders overestimating the underlying differencebetween genotypic groups in both phenotype and outcome, leading to an overestimateof the causal association in the direction of the confounded observational association.In the notation of Section 3, ∆u is large and, having the same sign as α1, leads to anestimate biased in the direction of β2

α2. Similarly, if the IV with the least F statistic

is used as an instrument, the IV estimator will be biased in the opposite directionto the observational association. These characteristics are evident in the simulations(Table 3). A commonly used rule for the validity of an IV is that the measured Fstatistic is greater than 10. However, if this rule is used to choose between instruments,this rule itself introduces a selection bias [23]. In this simulation, if we only considerthe causal estimates using 1 IV (first row of Table 3) and restrict attention to thoseanalyses with F > 10, the median estimate is 0.1874 (IQ range: -0.0303 to 0.3988).

When multiple instruments are used, a common econometric tool is an overiden-tification test [11], such as the Sargan test [24]. This is a test for incompatibility ofestimates based on different instruments, and can be used to test validity of the IVassumptions in a dataset. While this can be useful in indicating possible bias fromviolation of the underlying IV assumptions, it does not identify bias from the finite-sample violation of the IV assumptions due to weak instruments. For the data inTable 3, 6% of the simulations failed the Sargan test at p < 0.05, consistent with avalid instrument. While the median estimate using all six IVs in simulations whichfailed the Sargan test was 0.1779, the median estimation in simulations which passedthe Sargan test was 0.0693, close to the overall median of 0.0766. Overidentifica-tion tests are omnibus tests, where the alternative hypothesis includes failure of IVassumptions for one IV, failure for all IVs, and non-linear association between pheno-type and outcome. Hence, while the test can recognize problems with the model, ithas limited use to combat weak instruments.

5.2 Model of genetic association

As the magnitude of weak instrument bias depends on the F statistic, models forthe G-X association which give larger F statistics would be preferred. A model ofgenetic association with one parameter (for example a dominant/recessive model or

12

Page 13: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

per-allele model) will typically have a greater F statistic than model with a separatecoefficient for each level of the SNP (here called a categorical model). However, if thetrue model is misrepresented by the simpler model, then the latter will introduce biasdue to model misspecification.

To investigate this we modified model (5) with three instruments and model (1)with one instrument, so that the genetic association is not necessarily additive:

xi =K∑1

(α1k gik + dk 1(gik = 2) + α2 ui + εxi (6)

where 1(. . .) is an indicator function, gik ∈ {0, 1, 2} for all i, k and K = 1 or 3. Weconducted 100 000 simulations using the same parameters as Section 5.1, with α1k

fixed at 0.5 for all k and dk taking values 0 (true additive model), +0.2 (superad-ditive model) and −0.2 (subadditive model). With three instruments, the geneticinstruments divide the population of size 243 into 27 equally sized subgroups. Withone instrument, the population is divided into subgroups of size 108, 108 and 27,corresponding to a SNP with minor allele frequency 1

3.

3 IVs 1 IVMedian bias Error (MAB) Mean F Median bias Error (MAB) Mean F

Additive: categorical 0.038 0.087 11.2 0.056 0.218 7.8Per-allele 0.016 0.086 21.4 0.000 0.228 14.6

Superadditive: categorical 0.027 0.072 15.9 0.045 0.200 9.9Per-allele 0.011 0.072 30.3 0.000 0.207 18.5

Subadditive: categorical 0.056 0.109 7.7 0.068 0.240 6.3Per-allele 0.024 0.108 14.0 0.002 0.253 11.3

Table 4: Median bias and median absolute bias (MAB) of 2SLS IV estimate of β1 = 0 and meanF statistic across 100 000 simulations using per-allele and categorical modelling assumptions for trueadditive, superadditive and subadditive models

We analysed data generated assuming additivity between instruments and eithera per-allele model (1 instrument per SNP) or a categorical model (2 instruments perSNP) for each IV. Table 4 shows that the per-allele model has lower median biasthan the categorical model even when the underlying genetic model is misspecified.The median absolute bias is similar in each model, with a slight preference for thecategorical model with a single instrument. The categorical model suffers from greaterweak instrument bias because the mean F statistic is smaller. This indicates that,where the genetic model is approximately additive, the more parsimonious per-allelemodel should be preferred over a categorical model, as the gain in precision would notseem to justify the increase in bias.

13

Page 14: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

6 Bias using different IV methods

There are several methods for calculating IV estimates, some of which are more robustto weak instruments than others. We here comment on the 2SLS, limited informa-tion maximum likelihood (LIML) [25] and the Fuller(1) methods [26] as they can becalculated using the ivreg2 command in Stata [11].

The LIML estimator is close to median unbiased for all but the weakest instrumentsituations [15, 27]. With one IV, the estimate from LIML coincides with that fromthe ratio and 2SLS methods. However, Hahn, Hausman and Kuersteiner [28] stronglydiscourage the use of the LIML estimator, as it does not have defined moments for anynumber of instruments, as opposed to the 2SLS estimator, which has a finite variancewhen there are three or more instruments. The Fuller(1) estimator is an adaption ofthe LIML method [28], which again has better weak instrument properties than 2SLS[12], and is designed to have all moments, even with a single instrument [12, 27].

To investigate these properties, we conduct a simulation using the same parametersas in Section 4, analysing 100 000 simulations with 1 and 3 instruments using the 2SLS,LIML and Fuller(1) methods for instruments with α1 = 0.2, 0.4, 0.6. Table 5 showshow, with three IVs, the median bias is close to zero for LIML with instrumentswith mean F statistic greater than 5, whereas it is large for the 2SLS and Fuller(1)methods. For instruments with F close to 10, the mean bias of the Fuller(1) estimatoris close to zero. With one IV, as before, the 2SLS / LIML estimator is approximatelymedian unbiased with a mean F statistic of 10, whereas the Fuller(1) estimate stillshows considerable median and mean bias with a mean F statistic of 10.

This simulation shows a trade-off amongst IV methods between asymptotic andfinite sample properties. The LIML method performs best overall in terms of medianbias, even though mean bias is always undefined. However, methods with finite meanbias perform badly in terms of median bias. Absence of a finite mean would seem tobe more of a mathematical curiosity than a practical problem, as extreme values of thecausal estimate would generally be discarded due to implausibility and finite-samplefailure of the second IV assumption (non-zero G-X association) in the dataset.

7 Discussion

In this paper, we have seen that the IV estimator of causal association of phenotypeon outcome is biased in the direction of the confounded observational associationof phenotype and outcome. We have shown by simulation and using a variety ofexplanations that the magnitude of this bias depends on the statistical strength ofthe association between instrument and phenotype. When multiple instruments areused, we found in our simulations that the median size of the bias of the IV estimatoris approximately 1/F of the bias in the observational association, where F is the meanF statistic from the regression of phenotype on instrument. So a mean F statistic of10 limits the median relative bias to less than 10%. When a single instrument is used,a mean F statistic of 5 would seem to be sufficient to ensure median relative bias isabout 5%, and a mean F statistic greater than 10 ensures negligible bias from weak

14

Page 15: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

3 IVs 1 IV a

α1 = 0.2 α1 = 0.4 α1 = 0.6 α1 = 0.2 α1 = 0.4 α1 = 0.6a) Null causal effect, moderate positive confounding

Median bias2SLS 0.4556 0.1526 0.0702 }

0.3872 0.0401 0.0022LIML 0.1617 0.0033 -0.0017

Fuller(1) 0.3888 0.0858 0.0353 0.7324 0.2863 0.1159

Mean bias b 2SLS 0.4129 0.0935 0.0374 - - -Fuller(1) 0.4248 0.0338 0.0006 0.7091 0.2661 0.0673

b) Null causal effect, strong positive confounding

Median bias2SLS 0.9081 0.3121 0.1414 }

0.7692 0.0850 0.0041LIML 0.2899 0.0127 -0.0004

Fuller(1) 0.7675 0.1757 0.0718 1.4530 0.5678 0.2260

Mean bias2SLS 0.8217 0.1916 0.0761 - - -

Fuller(1) 0.8376 0.0721 0.0034 1.4235 0.5347 0.1297c) Negative causal effect, moderate positive confounding

Median bias2SLS 0.4571 0.1531 0.0715 }

0.3908 0.0399 0.0019LIML 0.1601 0.0028 0.0014

Fuller(1) 0.3915 0.0858 0.0376 0.7319 0.2860 0.1145

Mean bias2SLS 0.4096 0.0927 0.0391 - - -

Fuller(1) 0.4223 0.0339 0.0030 0.7083 0.2682 0.0643d) Negative causal effect, moderate negative confounding

Median bias2SLS -0.4555 -0.1545 -0.0706 }

-0.3842 -0.0413 -0.0020LIML -0.1580 -0.0035 0.0004

Fuller(1) -0.3858 -0.0862 -0.0360 -0.7297 -0.2882 -0.1158

Mean bias2SLS -0.4076 -0.0930 -0.0385 - - -

Fuller(1) -0.4214 -0.0339 -0.0019 -0.7086 -0.2694 -0.0659

Table 5: Median and mean bias across 100 000 simulations using 2SLS, LIML andFuller(1) methods for a range of strength of 3 and 1 instruments

aWith 1 IV, the estimates from 2SLS and LIML coincide.bMean bias is reported only when it is not theoretically infinite.

15

Page 16: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

instruments. A limitation of this conclusion is that, unlike for the relative mean bias,there is no theoretical basis for this approximation and we have undertaken only alimited simulation exercise.

As the magnitude of the bias depends on the mean F statistic, strategies thatincrease the mean F statistic should reduce weak instrument bias. F values are de-pendent on the sample size, so an IV can be strengthened by increasing sample size.For a given number of instruments k and coefficient of determination R2, a samplesize n gives F = n−k−1

kR2

1−R2 ; for constant R2, F thus increases approximately linearlywith sample size [29]. Larger sample sizes will also increase the precision of the IVestimator.

Similarly, unless substantially more of the variation in the phenotype is explainedby the additional IVs, the mean F statistic will decrease as more parameters are usedto model the genetic association with the phenotype. Provided the more parsimoniousmodels do not seriously misrepresent the data, their corresponding IV estimators willbe less biased. Using the example of a SNP with three levels as the IV, we sawthat even under moderate model misspecification, the more parsimonious model hadreduced bias.

If there are measured covariates that explain variation in the phenotype, which arenot correlated with the instrument and not on the causal pathway between phenotypeand outcome, then we can incorporate such covariates in our G-X model. This willboth increase precision in the causal estimate [21] and reduce weak instrument bias, asthe relevant F statistic (now a partial F statistic [30]) will increase. Precision will befurther increased if these covariates can be used to explain variation in the outcome.

When multiple instruments are used, the IV estimator reduces in variability com-pared to using any one of the instruments, but increases in bias. Hence there is atrade-off between bias and variability when choosing the number of instruments touse. Choosing instruments on the basis of strength in the data, such as by a measuredF statistic, increases bias. However, using instruments which have low true mean Fvalues will also lead to biased estimates. If available, we recommend determiningsample size for a Mendelian randomization investigation using the expected per-alleledifference in phenotype and the minor allele frequency for the IV from an independentlarge sample, in order to calculate the expected F statistic for the study.

Finally, we emphasize that the use of a genetic instrument in Mendelian random-ization relies on certain assumptions. In this paper, we have assumed, although thesemay fail in finite samples, that they hold asymptotically. If these assumptions donot hold, for example if there were a true correlation between the instrument and aconfounder, then IV estimates can be entirely misleading and “the cure can be worsethan the disease” [31].

We conclude with the following advice to practitioners:

• Genetic instruments used in Mendelian randomization studies should be pre-specified prior to data collection and analysis. The underlying IV assumptionsshould be tested carefully, as far as is possible.

• Sample sizes should be chosen where possible to ensure that the expected valueof F is greater than 10. F statistics should be quoted when reporting study

16

Page 17: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

results, but F statistics or other measures of instrument strength in the datashould not determine the analysis used.

• Where there are potential problems with weak instruments, results should beassessed for bias by sensitivity analyses, using various assumptions in the modelfor genetic association and methods of IV analysis such as LIML.

References

[1] Davey Smith G, Ebrahim S. Mendelian randomization: prospects, potentials,and limitations. Int. J. Epidemiol. 2004; 33(1):30–42, doi:10.1093/ije/dyh132.

[2] Didelez V, Sheehan N. Mendelian randomization as an instrumental variableapproach to causal inference. Statistical Methods in Medical Research 2007;16(4):309–330, doi:10.1177/0962280206077743.

[3] Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiol-ogy contribute to understanding environmental determinants of disease? Int. J.Epidemiol. 2003; 32(1):1–22, doi:10.1093/ije/dyg070.

[4] Timpson N, Sayers A, Davey-Smith G, Tobias J. How does body fat influencebone mass in childhood? A Mendelian randomization approach. Journal of Boneand Mineral Research 2009; 24:522–533, doi:10.1359/jbmr.081109.

[5] Kamstrup P, Tybjaerg-Hansen A, Steffensen R, Nordestgaard B. Genetically el-evated lipoprotein(a) and increased risk of myocardial infarction. JAMA 2009;301(22):2331–2339, doi:10.1001/jama.2009.801.

[6] Chen L, Davey Smith G, Harbord R, Lewis S. Alcohol intake and blood pressure:a systematic review implementing a Mendelian randomization approach. PLoSMed 2008; 5(3):e52, doi:10.1371/journal.pmed.0050052.

[7] Davey Smith G, Ebrahim S. What can Mendelian randomisation tell usabout modifiable behavioural and environmental exposures? BMJ 2005;330(7499):1076–1079, doi:10.1136/bmj.330.7499.1076.

[8] Lawlor D, Harbord R, Sterne J, Timpson N, Davey Smith G. Mendelian random-ization: using genes as instruments for making causal inferences in epidemiology.Statistics in Medicine 2008; 27(8):1133–1163, doi:10.1002/sim.3034.

[9] Greenland S. An introduction to instrumental variables for epidemiologists. Int.J. Epidemiol. 2000; 29(4):722–729, doi:10.1093/ije/29.4.722.

[10] Wald A. The fitting of straight lines if both variables are subject to error. TheAnnals of Mathematical Statistics 1940; 11(3):284–300.

[11] Baum C, Schaffer M, Stillman S. Instrumental variables and GMM: Estimationand testing. Stata Journal 2003; 3(1):1–31.

17

Page 18: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

[12] Stock J, Wright J, Yogo M. A survey of weak instruments and weak identificationin generalized method of moments. Journal of Business and Economic StatisticsOctober 2002; 20(4):518–529, doi:10.1198/073500102288618658.

[13] Bound J, Jaeger D, Baker R. Problems with instrumental variables estimationwhen the correlation between the instruments and the endogenous explana-tory variable is weak. Journal of the American Statistical Association 1995;90(430):443–450.

[14] Stock J, Yogo M. Testing for weak instruments in linear IV regression. SSRNeLibrary 2002; 11:T0284.

[15] Staiger D, Stock J. Instrumental variables regression with weak instruments.Econometrica 1997; 65(3):557–586.

[16] Richardson D. The exact distribution of a structural coefficient estimator. Journalof the American Statistical Association 1968; 63(324):1214–1226.

[17] Sawa T. The exact sampling distribution of ordinary least squares and two-stageleast squares estimators. Journal of the American Statistical Association 1969;64(327):923–937.

[18] Nelson C, Startz R. The distribution of the instrumental variables estimatorand its t-ratio when the instrument is a poor one. Journal of Business 1990;63(1):125–140.

[19] Lawlor D, Windmeijer F, Davey Smith G. Is Mendelian randomization ‘lost intranslation?’: Comments on ‘Mendelian randomization equals instrumental vari-able analysis with genetic instruments’ by Wehby et al. Statistics in Medicine2008; 27(15):2750–2755, doi:10.1002/sim.3308.

[20] Donald S, Newey W. Choosing the number of instruments. Econometrica 2001;69(5):1161–1191.

[21] Angrist J, Pischke J. Mostly harmless econometrics: an empiricist’s companion,Chapter 4 - Instrumental variables in action: sometimes you get what you need.Princeton Univ Pr, 2009.

[22] Zohoori N, Savitz D. Econometric approaches to epidemiologic data: Relating en-dogeneity and unobserved heterogeneity to confounding. Annals of Epidemiology1997; 7(4):251–257, doi:10.1016/S1047-2797(97)00023-9.

[23] Hall A, Rudebusch G, Wilcox D. Judging instrument relevance in instrumentalvariables estimation. International Economic Review 1996; 37(2):283–298.

[24] Sargan J. The estimation of economic relationships using instrumental variables.Econometrica 1958; 26(3):393–415.

18

Page 19: Bias in causal estimates from Mendelian randomization studies … · 2014. 3. 6. · Bias in causal estimates from Mendelian randomization studies with weak instruments Stephen Burgess

[25] Davidson R, MacKinnon J. Estimation and Inference in Econometrics. OxfordUniversity Press, USA, 1993.

[26] Fuller W. Some properties of a modification of the limited information estimator.Econometrica 1977; 45(4):939–953.

[27] Hansen C, Hausman J, Newey W. Estimation with many instrumental vari-ables. Journal of Business and Economic Statistics 2008; 26(4):398–422, doi:10.1198/073500108000000024.

[28] Hahn J, Hausman J, Kuersteiner G. Estimation with weak instruments: accu-racy of higher-order bias and MSE approximations. Econometrics Journal 2004;7(1):272–306, doi:10.1111/j.1368-423X.2004.00131.x.

[29] Dobson A. An introduction to generalized linear models. Chapman & Hall, 2001,doi:10.1201/9781420057683.

[30] Shea J. Instrument relevance in multivariate linear models: A simplemeasure. Review of Economics and Statistics 1997; 79(2):348–352, doi:10.1162/rest.1997.79.2.348.

[31] Bound J, Jaeger D, Baker R. The cure can be worse than the disease: A cau-tionary tale regarding instrumental variables. Technical Report 137, NBER June1993.

19