Exploratory quantile regression with many covariates: An ...
Transcript of Exploratory quantile regression with many covariates: An ...
Exploratory quantile regression with many covariates:An application to adverse birth outcomes
Lane F. Burgette1, Marie Lynn Miranda2, and Jerome P. Reiter1
1Department of Statistical Science, Duke University2Nicholas School of the Environment and Department of Pediatrics, Duke
University
April 21, 2011
Abstract
Covariates may affect continuous responses differently at various points of the
response distribution. For example, some exposure might have minimal impact on
conditional means, whereas it might lower conditional 10th percentiles sharply. Such
differential effects can be important to detect. In studies of the determinants of birth
weight, for instance, it is critical to identify exposures like the one above, since low
birth weight is a risk factor for later health problems. Effects of covariates on the
tails of distributions can be obscured by models that estimate conditional means like
linear regression; however, they can be detected via quantile regression. We present
two approaches for exploring high-dimensional predictor spaces to identify impor-
tant predictors for quantile regression. These are based on the lasso and elastic net
penalties. We apply the approaches to a prospective cohort study of adverse birth
outcomes that includes a wide array of demographic, medical, psychosocial, and en-
vironmental variables. While tobacco exposure is known to be associated with lower
birth weights, the analysis suggests an interesting interaction effect not previously re-
ported: tobacco exposure may act more strongly on the 20th and 30th percentiles of
birth weight when mothers have high levels of lead in their blood compared to when
mothers have low blood lead levels.
1
1 Introduction
In large epidemiological cohort studies, researchers often record data on dozens or even
hundreds of variables that are potentially relevant for regression models. When interac-
tion effects are suspected, the number of potentially relevant predictors can even exceed
the sample size. With such high-dimensional covariate spaces, specifying the predictors to
include in models can be daunting. Often, analysts use automated model selection proce-
dures to aid in finding potentially important predictors, such as stepwise regression and its
variants,1 regression trees,2 and penalized regression approaches.3,4
Most of these techniques are designed for estimating conditional means of the response
variable. Often, however, the effects of covariates on percentiles of the response distribu-
tion are relevant. For example, if some exposure decreases the birth weights of babies who
otherwise would have average to high birth weights by some modest amount, but it does
not decrease the birth weights of babies who otherwise would have low birth weights, it
might not be considered deleterious. On the other hand, an exposure would be troubling
if it lowers the birth weights of babies who already would have low birth weights, even if
it does not have a large impact on babies who otherwise would have average to high birth
weights.
Figure 1 illustrates this scenario graphically using artificial data. In each panel, the
mean response decreases by 300 units as the covariate moves from zero to one. But, at the
fifth response percentile, the effect of the covariate could be nearly absent (panel A) or
exaggerated (panel B). For birth weights, the impact of the covariate in panel B would be
more problematic than that of panel A.
Analysts can estimate such differential effects with quantile regression.5 Unlike least
squares regression, quantile regression lets covariates affect more than the mean structure.
However, analysts still must decide which among many possible covariates and interactions
to include in the model. We present two approaches for identifying potentially important
predictors in quantile regression that are particularly useful for high dimensional covari-
2
Covariate
Response
2500
3000
3500
● ●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●
●
●● ●
●
●●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
● ●●●
●
●
●
● ●●●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
● ● ●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
● ●
●
●
●● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
● ●●
● ●●●
●
●
●
●●
●
●
●
●
●●
●● ●
●
●
●
●
●● ●●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●●
● ●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
A
0.0 0.2 0.4 0.6 0.8 1.0
2500
3000
3500
● ●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
● ●
●
● ●●
●●●
●
●●
●●
● ●●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●●
●
●
●
●
●
●●
●●
●●
●
●●
●●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●●
●
● ● ●
●
●
●
●
● ● ●
●
● ●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
● ●●
● ●●
●
●
●
●
●
●
●
●
●●
●●● ●
●
●
●● ●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●●● ●
●
●
●●
●
●
●
● ●
●
●●
●●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●
●
●
● ●
●
● ●●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
● ●
●● ●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●● ●
●
● ●●●
●
●
●
●
●
●
●●
●
●
●●●
●
●●
●●
●
●●
●
●
●●●
●
●
●
●●
●●
●
●
●●
B
Figure 1: Artificial examples with the same true mean structure. The least squares fitsare similar (solid lines). Quantile regressions of the 5th percentile reveal very different tailbehavior (dashed lines).
3
ate spaces. First, we estimate quantile regression coefficients subject to the lasso penalty.4
This penalty forces many coefficients to equal zero, leaving the most important ones to
be nonzero. Second, we implement a quantile regression version of the elastic net.6 Like
lasso, the elastic net encourages coefficients to equal zero. But, it more effectively identifies
groups of important predictors that are highly correlated with each other.
Our exploratory framework complements existing methodology for penalized quantile
regression.7,8,9,10,11 Specifically, we adapt Zhao and Yu’s12 boosting technique to provide
computationally expedient and theoretically justified algorithms for penalized quantile re-
gression. Using both simulated data and data from a prospective study on the determi-
nants of birth weights, we demonstrate how epidemiologists can use these algorithms to
guide model specification for quantile regression in high dimensional settings.
2 Methods
Standard linear regressions are of the form
yi = x�i β + εi for i = 1, . . . , n (1)
where εi are independent and identically distributed errors with mean zero, β is a vector of
regression coefficients with length p, and x�i is a row vector of covariates for the ith indi-
vidual. The linear regression model implies that the distribution of y shifts in its mean but
not in its shape for different xi.
Quantile regression allows for both the mean and shape of the distribution of y to change
with x, without assuming a particular error distribution. Hence, it is more flexible than
linear regression. Quantile regression uses the same basic model as (1), but assumes that
εi are independent errors whose τth quantile is equal to zero, i.e., Pr(εi ≤ 0) = τ . Thus,
x�i β is interpreted as the conditional τth quantile of y given xi.
Standard quantile regression does not require a formal probability model.5 Instead, co-
4
efficients are estimated by minimizing the empirical loss function
L(β) =n�
i=1
ρτ (yi − x�i β), (2)
where
ρτ (a) =
τ |a| if a ≥ 0
(1− τ)|a| if a < 0.
(3)
Techniques exist for estimating standard errors of the estimated coefficients, testing hy-
potheses, and constructing interval estimates. Koenker and Hallock22 provide an accessible
introduction to the topic with several case studies, including an application to birth weight
data. However, they do not provide guidance on navigating the high dimensional covariate
spaces that motivate this article.
Quantile regression has advantages for modeling birth outcomes over other regression
approaches. Categorizing birth outcomes into low and high values23or a small number of
ordered categories24 is common in this field. But, using a logistic regression on an indica-
tor for low birth weight (less than 2500g) treats a birth at 2499g as fundamentally differ-
ent from a birth at 2501g; it also treats a birth at 2499g like one at 2000g. Neither treat-
ment is scientifically or clinically defensible.
Relatedly, least squares regression assumes that changing covariates merely shifts the
conditional distribution. Although the marginal distribution of birth weights is nearly
Gaussian, its conditional distribution is known to do more than shift with covariates. For
example, baby boys tend to weigh more than girls, but this difference attenuates in the
lower quantiles: small baby boys and girls are much closer in weight than a linear regres-
sion would suggest.22 Quantile regression can illuminate such changing effects.
We now present two approaches for exploring high dimensional quantile regression co-
variate spaces. We emphasize that these are exploratory techniques for identifying useful
models and do not provide formal inference.
5
2.1 Boosted lasso for quantile regression
Model selection techniques operate by penalizing large models. For example, in linear re-
gression, one may select the minimum Akaike Information Criterion (AIC)25 or Bayesian
Information Criterion (BIC)26 model. Adding a predictor penalizes these log likelihood-
based criteria values by two and log(n), respectively. Thus, the penalty increases with the
number of included predictors p.
The lasso method uses a penalty related to the sum of the absolute values of the re-
gression coefficients.4 Using simplified notation, lasso solutions for β minimize the function
Γ(β; λ1) = L(β) + λ1
p�
j=1
|βj|, (4)
for an empirical loss function L(β). For large values of λ1, many components of β are es-
timated to equal zero. As λ1 shrinks toward zero, the estimates of β move towards an un-
penalized estimate. Equivalently, one can solve (4) by minimizing L(β) subject to the re-
striction�
|βj| ≤ t1 for a value of t1 that corresponds to λ1.4 We follow the convention of
displaying lasso solutions using this representation.
Thus, lasso minimizes a loss function subject to a constraint on the length of the esti-
mated coefficients. AIC and BIC minimize the loss function subject to a constraint on the
number of nonzero coefficients, and then choose an appropriate number of nonzero compo-
nents. Although these methods are similar in spirit, enumerative minimization of AIC or
BIC is impractical in high dimensional applications. For example, our motivating analy-
sis equates to more than 600 possible predictors. Even restricting a search to models with
10 or fewer predictors results in more than 1021 possible models. In contrast, there exist
efficient methods to find lasso solutions, even in high dimensions.27
Before fitting a lasso or elastic net model, we center and scale covariates so that the
penalties exert their force equally on all components of β. Further, we work with a cen-
tered version of the response to penalize the intercept minimally. We also add any poten-
6
tially relevant interactions among the covariates to xi.
To adapt the lasso to exploratory quantile regression, we set L(β) to be the loss func-
tion in (2). Rather than use one specific t1, we investigate the solutions of (4) for a wide
range of t1, beginning with zero and ending at a large value. We find the lasso solutions
via a boosting algorithm;12 see the online supplement.
The solution paths, i.e., plots of estimated β versus t1, can be used to identify impor-
tant variables in the quantile regression. In particular, we search for variables whose esti-
mates take on nonzero values early in the solution path. To illustrate with simulated data,
consider the first panel of Figure 2, which shows the lasso solution paths for a 25th per-
centile regression with ten variables, five of which have nonzero true coefficients. The es-
timates of four parameters quickly take on nonzero values as we move from left to right,
suggesting that these are important variables. The other estimated coefficients take much
longer to rise above zero, suggesting they are less important. Hence, if seeking to fit a par-
simonious quantile regression, the first four variables would be a starting point for model
building.
Although this illustrative example has a modest number of predictors, lasso works in
settings with more predictors than observations; see the online supplement for examples.
We simply follow the solution path until the penalty is weak enough that the penalized
and unpenalized estimates are equal, at which point the algorithm stops. With numer-
ous predictors, we typically are interested in models with significantly fewer nonzero β ele-
ments than observations, so we can stop the algorithm early to save computational time.
It is possible to select one value of t1, e.g., via cross validation, and use the fitted model
at that point in the solution path as a final model. When analysts’ primary goal is making
out-of-sample predictions, or when there is scant domain knowledge to guide model selec-
tion, this approach can provide useful models. However, as an exploratory tool, we prefer
to use penalized quantile regression as a process for ranking the importance of groups of
covariates. We then combine domain knowledge with the information gleaned from the
7
t1
β
0.00
0.05
0.10
0.15
0.20
Lasso
0.0 0.2 0.4 0.6
λ2∗ = 1
10
0.00
0.05
0.10
0.15
0.20
100
0.00
0.05
0.10
0.15
0.20
0.0 0.2 0.4 0.6
500 1000
Figure 2: Start of lasso and elastic net solution paths for a quantile regression of the 25thpercentile of simulated data. Three of the covariates are very strongly correlated and havean associated true regression parameter of 0.2 (solid lines). The other covariates are lessstrongly correlated with each other. Two of these have a true β value of 0.3 (dashed lines)and the rest have a true value of zero (dotted lines). Lasso is equivalent to a λ∗2 = 0 elasticnet fit. The abscissa variable is t1 =
�pj=1 |βj|.
8
exploratory fits in the formal model building process. This process is described further in
Section 2.3 and in the online supplement.
2.2 Boosted elastic net for quantile regression
As an exploratory tool, lasso can be sub-optimal when there are highly correlated groups
of predictors, as it will tend to grant only one of them a nonzero β coefficient. Here, the
elastic net6 may better identify groups of important predictors. It penalizes the empirical
loss function by adding a term of λ2
�j β2
j to the lasso criterion. Thus, we seek the value
of β that minimizes
ΓEN(β; λ1, λ2) = L(β) + λ1
p�
j=1
|βj| + λ2
p�
j=1
β2j (5)
for given λ1 and λ2. As with lasso, larger values of λ1 encourage more zeros in the solu-
tion for β. With a large value of λ2, components of β are encouraged to take on small but
nonzero values, since the penalty disfavors large coefficients.
To see why the elastic net penalty encourages groups of correlated predictors to enter
the model simultaneously relative to lasso, consider covariates X1 and X2 that are nearly
identical for all subjects. The lasso penalty cannot easily distinguish between a solution
with (β1, β2) = (0, φ) and (β1, β2) = (φ/2, φ/2) for some φ �= 0, so that the solution may
present only one nonzero predictor. The elastic net penalty with λ2 > 0 prefers (φ/2, φ/2),
so that both predictors take on nonzero values at similar points in the solution path.
To adapt the elastic net to exploratory quantile regression, we let L(β) be the loss
function in (2). Rather than select one value of (λ1, λ2), we investigate the solutions for
a range of values to identify important predictors. To facilitate the exploration, we replace
λ2 with λ1λ∗2, where λ∗2 is a positive constant. For a fixed λ∗2, we fit the solution path as λ1
shrinks toward zero. As with lasso, we look for groups of coefficients that move away from
zero early and fast in the solution path. We again use boosting to fit the solution paths
9
given λ∗2; see the online supplement.
We examine the solution paths for several values of λ∗2, searching for a value of λ∗2 large
enough to allow more variables to enter the model early in the solution path (compared to
the lasso fit), but small enough that some parsimony is maintained early in the solution
path. Appropriate values of λ∗2 depend on the data. We recommend starting with λ∗2 =
1 and moving up or down by factors of five or ten until these goals are achieved. It may
be helpful to inspect intermediate values of λ∗2, but we have experienced only moderate
sensitivity to λ∗2 settings.
To illustrate, consider Figure 2, which shows the fits for λ∗2 between zero and 1000.
The fit with λ∗2 = 1 is nearly unchanged from the lasso (λ∗2 = 0) fit. When λ∗2 is 10 or 100,
a fifth variable enters the model early, thus appearing as an important predictor. When
we increase λ∗2 to 500 or 1000, it becomes difficult to discern exactly which variables are
likely to be most important since most of them take on nonzero values early in the solu-
tion path; we argue that these fits are not useful for an exploratory analysis. Thus, in this
simulated dataset, fits with λ∗2 equal to 10 or 100 satisfy our goals.
2.3 A framework for penalized exploratory quantile regression
The lasso and elastic net quantile regressions can be combined to improve exploratory
analyses for quantile regression. We recommend the following steps.
First, we fit the lasso regression solution path at the quantile of interest. We isolate
variables with coefficients that take on nonzero values early in the solution path. These
variables are the starting point for formal model building. We can rank the importance
of the remaining variables by the order in which their corresponding coefficients take on
nonzero values. We can choose to include some of these or additional variables in the final
model based on scientific considerations and formal hypothesis testing.
Second, as a check on the lasso results, we fit the elastic net at several values of λ∗2, as
outlined in Section 2.2. Here, we look for variables that appear early in the elastic net fit
10
but are not prominent in the lasso. These variables can be considered for inclusion based
on scientific merits and formal hypothesis tests. The elastic net is especially useful in set-
tings where interaction effects are suspected, since interaction variables are often collinear
with main effects and thus may be excluded by the lasso.
This is an exploratory framework designed to highlight prominent groups of predictors.
Like all exploratory model building techniques, it should supplement — not replace — sci-
entific rationales for model building. Analysts should add or subtract variables accord-
ingly. Nonetheless, in high dimensional settings with many predictors, exploratory quantile
regression tools can help analysts find important predictors that might otherwise be diffi-
cult to uncover, as we now illustrate on genuine data.
3 Application
We apply the exploratory framework to the Healthy Pregnancy, Healthy Baby Study (HPHBS),
an observational prospective cohort study in Durham, NC, focused on the etiology of ad-
verse birth outcomes. These adverse outcomes, including low birth weight and preterm
birth, have been linked to many problems13 including blindness,14 deafness,15 and behav-
ioral problems.16 Unfortunately, the causes of adverse birth outcomes are poorly under-
stood, although predictors cited as important include smoking,17 lead exposure,18 environ-
mental exposures more generally,13 and psychological stress.19,20,21
We focus on non-Hispanic black mothers, resulting in a sample size of 881 births. We
seek models for birth weight and gestational age, with particular emphasis on low quan-
tiles of these distributions, e.g., the 10th, 20th and 30th percentiles. The data comprise
35 covariates, including maternal demographics like age, education and income; mater-
nal medical history variables like previous birth outcomes; maternal environmental expo-
sures to cadmium, nicotine, cotinine, mercury, and lead as measured in blood; and ma-
ternal psychosocial factors like scores on the Interpersonal Support Evaluation List, the
11
CES-D depression scale, the NEO Five Factor Personality Inventory, perceived racism,
and availability of social support. We believe that interactions among these variables may
be important predictors of birth outcomes. Therefore, we include in x the main effects of
all predictors (using indicator variables for multi-category covariates), linear and squared
terms for continuous predictors, and all two-way interactions, resulting in over 600 co-
variates for consideration. Summary statistics for the covariates are in Table 1, and a his-
togram of birth weights is provided in the online supplement. Missing data in x (birth out-
comes are complete) were multiply imputed using regression trees.28
With lasso quantile regressions, measures of tobacco smoke exposure take on primary
importance, particularly for birth weight. When modeling the 10th percentile of birth
weight, four of the nine first covariates to enter the model are interactions that include a
tobacco measure (see Figure 3); at the 20th percentile, four of the first seven are tobacco-
related; at the 30th percentile, five of the first six are interactions involving tobacco. These
results are consistent with the literature that extensively documents the impact of smoking
on birthweight. Across the quantiles, a lead/tobacco interaction is consistently flagged.
Other important variables from the lasso 10th percentile regression for birth weight
include the mother’s age selected as a squared term and in an interaction with environ-
mental tobacco smoke exposure. The former suggests that the association between age
and birth weight is a U-shaped curve, which is again consistent with the literature.29 In
addition, the ISEL appraisal score, which is a measure of the availability of someone with
whom to discuss problems, was selected in an interaction with a variable indicating per-
ceived social standing and an interaction with having a “visiting” relationship with the
father of the child. At the 20th percentile, prominent variables (other than those involving
tobacco) include an interaction between negative paternal support and NEO-conscientiousness
(a measure of organization and persistence) and an interaction between negative pater-
nal support and the perceived self-efficacy score, which measures a woman’s belief that
she can steer her own life’s trajectory.30 Qualitatively, the results are similar at the 30th
12
percentile of birth weight. Smoking measures are prominent, and environmental measures
other than a tobacco/lead interaction are slow to enter the model.
In the presented results, we focus on birth weight as the response because lasso fits of
gestational age reveal fewer useful explanatory variables, with less consistency across the
response quantiles. This may be evidence that the covariates poorly capture the determi-
nants of gestational age.
t1
β
−20
0
20
40
0 50 100
elastic net−20
0
20
40
lasso
Figure 3: Lasso and elastic net solution paths for a quantile regression of the 10thpercentile of birth weight. Tobacco-related variables and interactions includingtobacco-related variables are solid lines; all others are dashed. The elastic net fit usesλ∗2 = 0.05. Many components are overplotted on the β = 0 line. The abscissa variable ist1 =
�pj=1 |βj|, and estimates are scaled in grams.
With the elastic net, the fits are essentially identical to the lasso fit when λ∗2 = 0.01.
When λ∗2 = 0.1, dozens of variables take on nonzero values very early in the fitting process,
13
which makes identifying the important variables difficult. Hence, we focus on the λ∗2 =
0.05 elastic net. This fit generally selects the same variables as lasso, but it also identifies
other potentially important covariates: interactions between blood lead levels and mother’s
self-reported tobacco use at the start of the pregnancy, and between blood lead levels and
mother’s tobacco use during the pregnancy, are among the early variables with nonzero
coefficients.
Building on the exploratory analyses, we estimate regressions for the 10th, 20th and
30th percentiles of birth weight. Although not selected in the exploratory stage, we in-
clude standard controls known to be predictors of birth weight, including the baby’s gen-
der and an indicator for first birth.29,30 Since the mother’s age appeared important as a
squared term, we include age and (age)2. We also consider the tobacco use at prenatal in-
take/lead interaction that was the first variable to appear in our 10th percentile lasso and
elastic net fits, and add in the main effects to comply with the hierarchy principle. We
also fit models including parental relationship status, ISEL appraisal scores, and their in-
teraction; however, statistically insignificant F -tests of these effects suggested that they
could be removed from the models without much sacrifice. We similarly remove the mater-
nal age/tobacco interaction after an F -test.
The results are summarized in Table 2, along with a median regression for complete-
ness. While age does not approach significance in any of the presented quantiles, (cen-
tered) age squared is consistently negative. This suggests a U-shaped effect of maternal
age on birth weight; i.e., women at the bottom and top end of the age distribution will
tend to have lower birth weights. It is not until the 30th percentile that infant sex nearly
reaches significance, and parity (first or higher order birth) never reaches significance in
the quantiles presented. Since infant sex and parity are well-established predictors of birth
weight in traditional estimates of the mean regression analysis, we might reasonably con-
clude that at lower quantiles, other processes overwhelm the effects that dominate at the
mean.
14
The coefficients on the environmental variables (the continuous blood lead measure-
ment, tobacco use at prenatal care intake, and their interaction) are especially interesting.
The combined main and interaction effects of lead and tobacco are essentially a wash at
the 10th percentile. At the 20th percentile, women who both use tobacco and who have
lead exposure that is one standard deviation above the mean have a net decrease in birth
weight of 174g. However, at the 10th and 20th percentiles, these coefficient estimates are
not significant. At the 30th percentile, the combined main and interaction effects are as-
sociated with a 168g decrement in birth weight. Here, the interaction effect is significant,
and the two main effects approach significance.
The meta-analysis of Navas et al.32 found that “evidence is sufficient to infer a causal
relationship between lead exposure and high blood pressure” (p. 478). Concurrently, hy-
pertension is associated with poorer birth outcomes.29 We also know the anomalous but
persistent result that smoking during pregnancy reduces the risk for preeclampsia33,34,35
(although among preeclamptics, smoking increases the risk for perinatal mortality and fe-
tal growth restriction). Both preeclampsia and maternal hypertension are associated with
lower birth weights. Because hypertension can be either the cause or the result of prob-
lems in pregnancy (especially if it progresses to preeclampsia), this exploratory analysis
suggests that a nuanced treatment of the lead-hypertension-smoking nexus may be critical
to understanding adverse birth outcomes. Such modeling is part of our ongoing research
agenda.
A recent report23 noted that while the relationships between environmental exposures
and preterm birth are in general poorly understood, “possible exceptions are lead and en-
vironmental tobacco smoke, for which the weight of evidence suggests that maternal expo-
sure to these pollutants increases the risk for preterm birth” (p. 229). These variables are
also important contributors to low birth weights.13 The potential lead/smoking interaction
— while apparently unreported in the birth outcomes literature — has been highlighted in
other contexts. For example, the interaction has been found to be important for predicting
15
mortality due to cancer36 and attention deficit/hyperactivity disorder.37
4 Summary
Quantile regression facilitates identifying covariates that differentially affect the tails of
continuous response distributions, which often are of clinical and scientific relevance. We
presented an exploratory framework for identifying potentially important covariates in
quantile regression for high dimensional predictor spaces. The underlying algorithms, based
on the lasso and elastic net penalties, are computationally fast and simple to use, thereby
offering analysts methods for investigating the importance of hundreds of potential regres-
sors on noncentral features of the response distribution. We applied the framework to a
study of adverse birth outcomes and identified an interaction between maternal lead expo-
sure and smoking that was not previously discussed in the literature.
The methods presented here have wider applicability beyond studies of adverse birth
outcomes. For example, quantile regression has been used to study effects of tobacco in
relation to sleep patterns,38 degradation of mental acuity in multiple sclerosis patients,39
and determinants of obesity.40 In each case, quantile regression provides a fuller picture of
covariate effects than standard linear regression could. As quantile regression gains pop-
ularity and datasets become richer, we believe that our exploratory techniques will be a
useful addition to the epidemiologist’s toolbox.
5 Acknowledgements
The authors wish to thank Geeta Swamy, two anonymous referees and Editors Wacholder
and Wilcox for comments that improved the manuscript.
16
References
1. RB Bendel, AA Afifi. Comparison of stopping rules in forward “stepwise” regression. J
Am Stat Assoc. 1977;72:46–53.
2. L Breiman, JH Friedman, RA Olshen, CJ Stone. Classification and Regression Trees.
Boca Raton: Chapman & Hall/CRC; 1984.
3. AE Hoerl, RW Kennard. Ridge regression: Biased estimation for nonorthogonal prob-
lems. Technometrics. 2000;42:80–86.
4. R Tibshirani. Regression shrinkage and selection via the lasso. J R Stat Soc Series B
Stat Methodol. 1996;58:267–288.
5. R Koenker, G Bassett Jr. Regression quantiles. Econometrica. 1978;46:33–50.
6. H Zou, T Hastie. Regularization and variable selection via the elastic net. J R Stat Soc
Series B Stat Methodol. 2005;67:301–320.
7. N Fenske, T Kneib, T Hothorn. Identifying risk factors for severe childhood malnutri-
tion by boosting additive quantile regression. Technical Report, University of Munich
Department of Statistics, 2009.
8. Y Wu, Y Liu. Variable selection in quantile regression. Stat Sin. 2009;19:801–817.
9. R Koenker. Quantile regression for longitudinal data. J of Multivar Anal. 2004;91:74–
89.
10. Y Li, J Zhu. L1-norm quantile regression. J Comput and Graph Stat. 2008;17:163–185.
11. Q Li, R Xi, N Lin. Bayesian regularized quantile regression. Bayesian Anal. 2010;5:533–
556.
12. P Zhao, B Yu. Boosted lasso. In: H Lui, R Stine, L Auslender, chairs. Workshop on
Feature Selection for Data Mining. Newport Beach, CA: SIAM; 2005:35–44.
13. ML Miranda, D Kim, JP Reiter, MA Overstreet Galeano, P Maxson. Environmental
contributors to the achievement gap. Neurotoxicology. 2009;30:1019–1024.
14. BJ Crofts, R King, A Johnson. The contribution of low birth weight to severe vision
loss in a geographically defined population. Br J of Ophthalmol. 1998;82:9–13.
17
15. JM Lorenz, DE Wooliever, JR Jetton, N Paneth. A quantitative review of mortality
and developmental disability in extremely premature newborns. Arch Pediatr Adolesc
Med. 1998;152:425–435.
16. MC McCormick, SL Gortmaker, AM Sobol. Very low birth weight children: Behavior
problems and school difficulty in a national sample. J Pediatr. 1990;117:687–693.
17. DA Savitz, N Dole, JW Terry Jr, H Zhou, JM Thorp Jr. Smoking and pregnancy out-
come among African-American and white women in central North Carolina. Epidemiol-
ogy. 2001;12:636–642.
18. T Gonzalez-Cossio, KE Peterson, LH Sanin, et al. Decrease in birth weight in relation
to maternal bone-lead burden. Pediatrics. 1997;100:856–862.
19. RW Newton, LP Hunt. Psychosocial stress in pregnancy and its relation to low birth
weight. BMJ. 1984;288:1191–1194.
20. ST Orr, JP Reiter, DG Blazer, SA James. Maternal prenatal pregnancy-related anxi-
ety and spontaneous preterm birth in Baltimore, Maryland. Psychosom Med. 2007;69:566–
570.
21. ST Orr, DG Blazer, SA James, JP Reiter. Depressive symptoms and indicators of ma-
ternal health status during pregnancy. J Womens Health. 2007;16:535–542.
22. R Koenker, KF Hallock. Quantile regression. J Econ Perspect. 2001;15:143–156.
23. RE Behrman, AS Butler. Preterm Birth: Causes, Consequences and Prevention. Wash-
ington, DC: National Academy Press; 1997.
24. GC Windham, B Hopkins, L Fenster, SH Swan. Prenatal active or passive tobacco
smoke exposure and the risk of preterm delivery or low birth weight. Epidemiology.
2000;11:427–433.
25. H Akaike. Information theory and an extension of the maximum likelihood principle.
In: BN Petrox and F Caski, eds. Second International Symposium on Information The-
ory. Budapest: Academiai Kiado; 1973:267–281.
26. G Schwarz. Estimating the dimension of a model. Ann Stat. 1978;6:461–464.
18
27. B Efron, T Hastie, I Johnstone, R Tibshirani. Least angle regression. Ann Stat. 2004;32:407–
499.
28. LF Burgette, JP Reiter. Multiple imputation for missing data via sequential regression
trees. Am J Epidemiol. 2010;172:1070–1076.
29. GK Swamy, S Edwards, A Gelfand, SA James, ML Miranda. Maternal age, birth or-
der and race: Differential effects on birthweight. J Epidemiol Community Health. Forthcoming;1–
23.
30. A Bandura. Self-efficacy. In Corsini Encyclopedia of Psychology. Wiley Online Li-
brary, 2010.
31. P Magnus, K Berg, T Bjerkedal. The association of parity and birth weight: Testing
the sensitization hypothesis. Early Hum Dev. 1985;12:49–54.
32. A Navas-Acien, E Guallar, EK Silbergeld, SJ Rothenberg. Lead exposure and cardio-
vascular disease—A systematic review. Environ Health Perspect. 2007;115:472–482.
33. S Cnattingius, JL Mills, J Yuen, O Eriksson, H Salonen. The paradoxical effect of
smoking in preeclamptic pregnancies: Smoking reduces the incidence but increases the
rates of perinatal mortality, abruptio placentae, and intrauterine growth restriction.
Am J Obstet Gynecol. 1997;177:156–161.
34. A Castles, EK Adams, CL Melvin, C Kelsch, ML Boulton. Effects of smoking during
pregnancy: Five meta-analyses. Am J Prev Med. 1999;16:208–215.
35. KY Lain, RW Powers, MA Krohn, RB Ness, WR Crombleholme, JM Roberts. Urinary
cotinine concentration confirms the reduced risk of preeclampsia with tobacco expo-
sure. Am J Obstet Gynecol. 1999;181:1192–1196.
36. M Lustberg, E Silbergeld. Blood lead levels and mortality. Arch Intern Med. 2002;162:2443–
2449.
37. TE Froehlich, BP Lanphear, P Auinger, et al. Association of tobacco and lead expo-
sures with attention deficit/hyperactivity disorder. Pediatrics. 2009;124:e1050–e1063.
19
38. L Zhang, J Samet, B Caffo, NM Punjabi. Cigarette smoking and nocturnal sleep ar-
chitecture. Am J Epidemiol. 2006;164:529–537.
39. RA Marrie, NV Dawson, A Garland. Quantile regression and restricted cubic splines
are useful for exploring relationships between continuous variables. J Clin Epidemiol.
2009;62:511–517.
40. K Kan and WD Tsai. Obesity and risk knowledge. J Health Econ. 2004;23:907–934.
20
Variable Description Mean SDIncome Ordered categorical reported income < $5000∗
Maternal education Ordered categorical Some high school∗
Maternal age In years 25 6Private insurance Private health insurance 13%Parity Is this the mother’s first child? 38%Cotinine Blood cotinine measurement (ng/mL) 21 58Nicotine Blood nicotine measurement (ng/mL) 0.6 2Cadmium Blood cadmium measurement (ng/mL) 0.4 0.6Mercury Blood mercury measurement (ng/mL) 0.5 1Manganese Blood manganese measurement (ng/mL) 1 1Lead Blood lead measurement (µg/dL) 0.4 0.8ISEL Belong ISEL belonging mean 9 2ISEL Self ISEL self-esteem score 9 2ISEL Appr ISEL appraisal score 10 3ISEL Tang ISEL Tangible score 9 3ISEL Total ISEL total 38 8Paternal visit Visiting relationship with father 22%Not romantically involved Not romantically involved with father 21%PS Score Mean of paternal support 2.7 0.3Pos support Score of positive paternal support 2.5 0.5Neg support Score of negative paternal support 1.2 0.3NEO-N NEO Neuroticism score 20 7NEO-A NEO Agreeableness score 31 6NEO-E NEO Extroversion score 29 6NEO-O NEO Openness score 25 5NEO-C NEO Conscientiousness score 35 6Perceived racism Sum of measures of experienced racism 0.8 1.3JHAC score John Henryism 52 6Ladder Position relative to peers 7 2CESD score Depression score 16 10.6PSS score Perceived stress score 2.6 0.7SE score Self-efficacy 3.3 0.5Environmental tobacco smoke Self-report environmental tobacco exposure 68%TobUse Tobacco use at delivery 20%Tobacco Tobacco use at prenatal care intake 19%
Table 1: Variables (along with interactions and squared terms) included in exploratoryanalyses with means and standard deviations. Starred entries are modal responses.
21
10th
per
cent
ile
20th
per
cent
ile
30th
per
cent
ile
Med
ian
Est
imat
eIn
terv
alE
stim
ate
Inte
rval
Est
imat
eIn
terv
alE
stim
ate
Inte
rval
Inte
rcep
t22
71(1
977,
2564
)26
44(2
497,
2791
)28
02(2
715,
2890
)30
38(2
928,
3147
)A
ge(c
ente
red)
-5(-
31,20
)-1
(-16
,14
)1
(-11
,14
)-5
(-16
,6)
Age
2(c
ente
red)
-3(-
6,1)
-3(-
5,-1
)-2
(-4,
-0)
-1(-
2,1)
Mal
e-2
1(-
261,
219)
37(-
92,16
6)98
(-4,
201)
80(-
10,17
0)2n
dbab
yor
late
r-1
9(-
266,
228)
54(-
85,19
2)0
(-11
3,11
5)60
(-43
,16
3)B
lood
lead
206
(48,
365)
109
(-11
,23
0)81
(-7,
170)
88(2
1,15
6)Tob
acco
use
106
(-18
7,39
9)-8
6(-
240,
68)
-91
(-23
0,47
)-1
02(-
212,
9)Lea
d/T
obac
co-3
03(-
716,
110)
-197
(-36
8,29
)-1
58(-
280,
-37)
-180
(-30
5,-5
6)
Tab
le2:
Quan
tile
regr
essi
onof
10th
,20
th,30
th,an
d50
thper
cent
iles
ofbir
thw
eigh
t,usi
ng
the
lead
/tob
acco
inte
ract
ion
sugg
este
dby
explo
rato
ryan
alys
es,w
ith
par
amet
eres
tim
ates
and
95%
confiden
cebou
nds.
Est
imat
esar
ein
gram
s.
22