Brian Reich (NC State) M. Fuentes (VCU), J. Warren (Yale ...
Transcript of Brian Reich (NC State) M. Fuentes (VCU), J. Warren (Yale ...
Spatial risk estimation of adverse pregnancy outcomes due to pollution
Brian Reich (NC State)
M. Fuentes (VCU), J. Warren (Yale), A. Herring (UNC), P. Langlois (Texas SHE)
TIES 2016
Scientific Motivation: Why Preterm Birth?
• Preterm birth (delivery before 37 completed weeks of gestation) amajor cause of infant morbidity and mortality
• Cause of roughly 50% of preterm births unknown
• Institute of Medicine (IOM) estimates average first-year medical costsare 10 times greater for a preterm relative to a term birth
• IOM estimates total cost over $26 billion annually (over $50K perpreterm birth) including hospital costs, special education, etc.
• US national incidence around 12%; around 1.6 times higher inAfrican-American than in white and/or Hispanic women
References: March of Dimes Prematurity Campaign (2010); Martin etal. (2010).
NC State University 2 / 45TIES 2016
Brian Reich
Scientific Motivation: Preterm birth and air pollution
Specific Results of Literature Review linking preterm birth and airpollution:
• Preterm Birth: Current studies about the impact of air pollution onpreterm birth are not conclusive, but indicate a potential associationand the need for further research.
• More information is needed to examine:• Effect of different pollutants• Most vulnerable periods of the pregnancy
• Significant pollutants include particulate matter.
References: Sram et al. (2005) and the U.S. EPA PM and CO IntegratedScientific Assessments (2009 and 2010).
NC State University 3 / 45TIES 2016
Brian Reich
Risk Factors for Preterm Birth
• Black mothers at higher risk (16-18%)
• Low socioeconomic and educational status
• Low and high maternal ages
• Single marital status
• Working long hours
• Frequent hard physical labor
• Mechanisms by which maternal demographics are related to pretermbirth are still unknown
• Source: Romero (2008)
NC State University 4 / 45TIES 2016
Brian Reich
Statistical Motivation: Challenges
• Incorporate spatial analysis of environmental health data, typically notconsidered in classical birth outcome epidemiological studies
• Develop new statistical methodology to allow the ability to uncovertrue relationships between ambient air pollution exposure and pretermbirth
• Create statistical models to identify the specific critical windowsduring the pregnancy when high exposures to pollutants morenegatively affect the birth outcomes
• Characterize the uncertainty in models and data in the risk assessment
NC State University 5 / 45TIES 2016
Brian Reich
Data Sources and Information
Texas Vital Records Data:
• Data provided by the Birth Defects Epidemiology andSurveillance Branch of the Texas Department of State HealthServices in collaboration with Dr. Peter Langlois
• Full birth records from all births in Texas where
1 Mother resided at delivery in a region and time period covered by thebirth defect registry
• 1997-2004
2 demographic information
NC State University 6 / 45TIES 2016
Brian Reich
Data Sources and Information
Figure: Birth Weight (grams) and Gest. Age (weeks) Histograms
Variable Mean SD Min Max N
Birth Weight (grams) 3286.73 579.79 0.00 8098.00 2625678
Gest. Age (weeks) 38.59 2.17 12.00 48.00 2588704
Table: Summary Statistics for Texas Vital Records Data, 1997-2004
NC State University 7 / 45TIES 2016
Brian Reich
Data Sources and Information
Infant/Fetus and Pregnancy Information Included:
• Case/Control Status
• Sex
• Birth Weight
• Pregnancy Outcome
• Plurality of Pregnancy
• Gestational Age Estimates
• Number of Previous Live Births
Mother/Father Information Included:
• Age
• Birthplace
• Race/Ethnicity
• Education
NC State University 8 / 45TIES 2016
Brian Reich
Data Sources and Information
Geocoding Information:
• Each birth is geocoded to the residence at delivery
• Included Information:
• Latitude/Longitude• City, County• Street Address, ZIP Code• Geocoding Accuracy
• Texas (1995-2002) study found that 68% of pregnant women didnot move between date of conception and date of birth (Nuckolset al. (2004))
• Of the women who did move, 49% stayed very close geographically(same water supply source)
NC State University 9 / 45TIES 2016
Brian Reich
Data Sources and Information
Figure: Birth Locations in Texas, 1997-2004
NC State University 10 / 45TIES 2016
Brian Reich
Data Sources and Information
Pollution Data Sources:
• Fused Air and Deposition Surfaces Data (FSD): 2001-2006• EPA product which calibrates Community Multiscale Air Quality
(CMAQ) data using observed monitoring data• Nation-wide daily data available on the CMAQ grids (12km x 12km
and 36km x 36km spatial resolution)• Ozone- Maximum daily 8-hour average (parts per billion (ppb))• PM2.5- Daily average (micrograms per cubic meter (ug/m3))
• Air Quality System (AQS) Data (monitoring data): 2000-2004• Ozone- Maximum daily 8-hour average (parts per million (ppm))• PM2.5- Daily average (micrograms per cubic meter (ug/m3))
Weather Data (2000-2004): Daily temperature, dew point, cloudcover,and windspeed.
NC State University 11 / 45TIES 2016
Brian Reich
FSD Model
• Y (s): Monitoring data for site s
• Q(Aj): CMAQ numerical model output for grid cell Aj
• Z (s): True pollution value for site s
• Latent process, unobservable without measurement error
Model:
• Y (s) = Z (s) + ε(s)
• ε(s)iid∼ N(0, σ2
ε ); measurement error
• Q(s) = a(s) + b(s)Z (s) + γ(s)
• a(s): Additive bias associated with CMAQ data• b(s): Multiplicative bias associated with the CMAQ data
• γ(s)iid∼ N(0, σ2
γ); random error
NC State University 12 / 45TIES 2016
Brian Reich
FSD Model
• Q(Aj) =∫Aja(s)ds + b
∫AjZ (s)ds +
∫Ajγ(s)ds
• CMAQ data given as areal estimates
• Z (s) = µ(s) + ν(s)• µ(s): Large scale spatial trend• ν: Gaussian process with specified spatial covariance structure• b(s) constant over space based on recommendations by EPA
• Values simulated from posterior predictive distribution (ppd) forspatial prediction
• ppd: P(Z (s ′)|Y,Q)
• EPA adopted this model (Fuentes and Raftery (2005)) and extendedit to space-time to create their FSD product (McMillan et al. (2010))
NC State University 13 / 45TIES 2016
Brian Reich
Data Sources and Information
Pollution Data Sources (Texas Only):
• Air Quality System (AQS) Data: 2000-2004• Ozone- Maximum daily 8-hour average (parts per million (ppm))• PM2.5- Daily average (micrograms per cubic meter (ug/m3))
Weather Data: 2000-2004
• Daily weather data from the National Climatic Center (NCDC)
NC State University 14 / 45TIES 2016
Brian Reich
Active Texas Pollution Monitors: 2000-2004
Figure: Ozone Monitor Locations (Red), PM2.5 Monitor Locations (Blue)
NC State University 15 / 45TIES 2016
Brian Reich
Previous Work
• Exposure to pollution during pregnancy typically handled throughtrimester, monthly, and pregnancy averages; fit separately usingmultiple models for different exposure windows, including separatemodels for different pollutants (Xu et al. (1995); Bobak (2000); Linet al. (2001))
• Method is inefficient and does not allow us to jointly identify specificperiods across the entire pregnancy in a continuous manner
• Empirical analysis indicated that increased exposure to PM2.5 andozone during the pregnancy lead to statistically significant decreasesin gestational age and birth weight
NC State University 16 / 45TIES 2016
Brian Reich
Empirical Analysis
• Jointly modeled birth weight and gestational age in Harris County(includes Houston), 2002-2003
• Fit a mixture of bivariate normal distributions with independentcomponents
• Used first trimester averages of PM2.5 and ozone as the pollutionexposure (models fit separately)
• Modeled the mean using mother/father covariate information
NC State University 17 / 45TIES 2016
Brian Reich
Empirical Analysis Results
• Found that a mixture of two bivariate normal distributions providedthe best (based on AIC/BIC) fit for the data
• Distribution 1: About 8% of the population was from this distribution• This group of births had much lower birth weights and gestational ages
on average than a typical birth
• Distribution 2: About 92% of the population was from thisdistribution
• This group of births had typical birth weights and gestational ages onaverage
NC State University 18 / 45TIES 2016
Brian Reich
Empirical Analysis Results
Groups Identified by Mixture Model Analysis
25 30 35 40 45
010
0020
0030
0040
0050
0060
00
Gest. Age (Weeks)
Wei
ght (
Gra
ms)
Figure: Identified Births from Distribution 1 (Red) and Distribution 2 (Black)
NC State University 19 / 45TIES 2016
Brian Reich
Distribution 1, Birth Weight Results (grams)
Variable Estimate (PM2.5 Model) p-value
Intercept 2611.81 <0.0001
Female vs. Male Baby -73.28 0.0484
Mother ≥ 40 vs. Mother 10 − 19 -558.44 0.0015
Black vs. White Mother (Non-Hispanic) -547.25 <0.0001
First Trimester Pollution Average -162.46 <0.0001
Table: PM2.5 Model Results.
Variable Estimate (Ozone Model) p-value
Intercept 2659.17 <0.0001
Female vs. Male Baby -86.81 0.0185
Mother ≥ 40 vs. Mother 10 − 19 -548.23 0.0012
Black vs. White Mother (Non-Hispanic) -587.28 <0.0001
First Trimester Pollution Average -228.13 <0.0001
Table: Ozone Model Results.NC State University 20 / 45
TIES 2016Brian Reich
Distribution 1, Gest. Age Results (weeks)
Variable Estimate (PM2.5 Model) p-value
Intercept 35.79 <0.0001
Mother ≥ 40 vs. Mother 10 − 19 -2.58 0.0035
Black vs. White Mother (Non-Hispanic) -2.28 <0.0001
First Trimester Pollution Average -0.92 <0.0001
Table: PM2.5 Model Results.
Variable Estimate (Ozone Model) p-value
Intercept 35.99 <0.0001
Mother ≥ 40 vs. Mother 10 − 19 -2.61 <0.0001
Black vs. White Mother (Non-Hispanic) -2.47 0.0023
First Trimester Pollution Average -1.26 <0.0001
Table: Ozone Model Results.
NC State University 21 / 45TIES 2016
Brian Reich
Empirical Analysis Results
• High number of observations in distribution 2 caused most effects tobe found significant
• Pollution exposure effects were found to be significantly negative
• An increased exposure to PM2.5 and ozone in the first trimesterappears to be associated with a decrease in birth weight andgestational age for both distributions
NC State University 22 / 45TIES 2016
Brian Reich
Preterm Birth Model: Trimester Average Approach
Yi |β,θind∼ Bern(pi (β,θ)),
pi (β,θ) = probability birth i results in preterm birth,
logit(pi (β,θ)) = xiTβ +
2∑j=1
2∑k=1
θ(j , k)Zj(ti (k), s(i)),
• k : Trimester of the pregnancy• Zj(ti (k), s(i)): Trimester average of pollutant j at location s(i)
according to calendar date range ti (k)• Pollution exposures based on woman specific dates of pregnancy and
calculated using observations from the nearest pollution monitor• Similar to models used by Bobak (2000), Ritz et al. (2007), and Liu
et al. (2003)
NC State University 23 / 45TIES 2016
Brian Reich
Preterm Birth Model: Trimester Average Approach
θ(j , k): Effect of the concentration of air pollutant j on trimester k on theprobability of PTB for woman i
• Model often fit in the Frequentist setting with the θ parametersconsidered as fixed, ignoring the correlation that exists between thetrimester averages and across pollutants (Ritz et al. (2000); Ritz etal. (2007); Xu et al. (1995))
• Changes in the pollution effects across space most often times ignored
• To avoid this multicollinearity, multiple models are usually fit:
• One pollutant at a time• One pollution exposure measure at at time
NC State University 24 / 45TIES 2016
Brian Reich
New Methodology
• Incorporating the spatial aspect into the modeling procedure to allowthe pollution exposure effects to change over the geographic domainof interest
• Allowing a more continuous form of exposure over the entirepregnancy by introducing weekly exposures
• Allowing multiple pollutants to be modeled simultaneously whileaccounting for the correlation between the effects
• Bayesian framework allows for flexible solution to correctlycharacterize the uncertainty associated with the parameter estimates
NC State University 25 / 45TIES 2016
Brian Reich
General Framework
Weather Model:
W(s, t) = µ(s, t) + e1(s, t) + ε1(s, t)
• W(s, t): Vector of weather observations at various locations andtimes
• µ(s, t): Large scale spatial and temporal trend of the data
• e1(s, t): Spatially and temporally correlated zero mean Gaussianprocess
• ε1(s, t)iid∼ N(0, σ2
1)
• Measurement error associated with the weather monitors
NC State University 26 / 45TIES 2016
Brian Reich
General Framework
Pollution Model:
• Zj(s, t): AQS monitor observation for pollutant j at location sand time t
Zj(s, t) = Zj(s, t) + ε2(s, t)
• Zj(s, t): True pollution process for pollutant j at location s andtime t
• Unobservable without measurement error
• ε2(s, t)iid∼ N(0, σ2
2)
• Measurement error associated with the pollution monitors
NC State University 27 / 45TIES 2016
Brian Reich
General Framework
True Pollution Process:
Zj(s, t) = W(s, t)Tδ + e2(s, t)
• e2(s, t): Spatially and temporally correlated zero mean Gaussianprocess
• Posterior distribution of the weather covariates used as prior inthis stage (Gelman 2004)
• “Directional” Bayesian approach provides computational benefit as wellas not allowing the final health stage of modeling to provideinformation about the initial weather stage
• Values of Zj(s, t) simulated from the ppd, P(Zj |Zj), at eachwoman’s location during the relevant pregnancy window
NC State University 28 / 45TIES 2016
Brian Reich
New Methodology
Preterm Birth Model:
Yi |β,θind∼ Bern(pi(β,θ)),
pi(β,θ) = probability birth i results in preterm birth,
Φ−1(pi(β,θ)) = xiTβ +
2∑j=1
min(gai ,36)∑w=1
θ(j ,w , si)Zj(ti(w), si),
• Φ−1(.): Inverse cumulative distribution function of the standardnormal distribution
• ‘gai ’ is the gestational age (weeks) for birth i
NC State University 29 / 45TIES 2016
Brian Reich
Modeling the Pollution Coefficients
• θ(j ,w , s): Pollutant specific spatially and temporally varyingcoefficients
• Represent the effects of the concentration of air pollutant j at locations on pregnancy week w on the probability of PTB for woman i
• Zj(ti (w), si ) represents the pollution exposure for pollutant j oncalendar week ti (w) at location si
• Ozone and PM2.5 used in the analysis
NC State University 30 / 45TIES 2016
Brian Reich
New Methodology
Prior Process of θ Parameters:
θ ∼ MVN(0,φ0Σ), with entries of φ0Σ are given by,
cov(θ(j ,w , s), θ(j′,w
′, s
′))
= φ0 exp{−φ1||s − s
′ || − φ2|w − w′ | − φ3I (j 6= j
′)}
• Exponential covariance structure provides relatively simpleparameterization which allows separate degrees of shrinkage across airpollutants j , pregnancy week w , and location s
• Allows us to overcome multicollinearity introduced by using weeklyexposures
• Computationally feasible because Σ is the Kronecker product of 3smaller dimension correlation matrices
NC State University 31 / 45TIES 2016
Brian Reich
Application
Harris County, Texas 2002-2004. Mother’s first birth. Single births.Birth defect births included (live births only).
Harris County, TX:• Third largest county in the US as of July, 2009
• 4,070,989 estimated people in the county during this time• Los Angeles County, CA and Cook County, IL only larger counties
• Includes Houston, TX which is provides a large amount ofheterogeneity to our study population
• Excellent pollution and weather monitor coverage in the county
NC State University 32 / 45TIES 2016
Brian Reich
Harris County Pollution Monitors: 2002-2004
Figure: Harris County, TX (Left); Active Harris County and Surrounding AreaOzone (Red) and PM2.5 (Blue) Pollution Monitors, 2002-2004 (Right)
NC State University 33 / 45TIES 2016
Brian Reich
Significant Parental Covariate Results
Percentiles
Covariate Mean 0.025 0.50 0.975
Maternal Race
Black vs. White 0.1498 0.0187 0.1495 0.2827
Paternal Race
Other vs. White -0.1887 –0.3369 –0.1886 –0.0410
Maternal Age Group
20− 24 vs. 10− 19 –0.0742 –0.1382 –0.0742 –0.0103
35− 39 vs. 10− 19 0.1108 0.0093 0.1108 0.2114
≥ 40 vs. 10− 19 0.2959 0.1176 0.2968 0.4701
Female vs. Male Baby –0.0675 –0.1080 –0.0677 –0.0265
Table: Significant covariate results for the predicted AQS data, 2002-2004. The MC error for the means ranged from 0.0001to 0.0007 with an average value of 0.0004.
NC State University 34 / 45TIES 2016
Brian Reich
Model Comparisons
• Fit a simplified version of the model to compare results with thenewly introduced model
• Simplified model represents a basic multiple regression model whichignores the correlation that exists between the pollution coefficients
• Does not have the ability to identify the windows of critical exposurethroughout the pregnancy for either pollutant
• Fit trimester average model
• Both models fit in the Bayesian setting
NC State University 35 / 45TIES 2016
Brian Reich
Trimester Average and Basic Multiple Regression ModelGraphical Results
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
020.
000.
020.
040.
060.
080.
100.
12
PM2.5
Trimester
Effe
ct
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−0.
020.
000.
020.
040.
060.
080.
100.
12
Ozone
Trimester
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
10−
0.05
0.00
0.05
Ozone
Week
Effe
ct
Figure: Susceptible windows of exposure for the trimester average and basic multiple regression
models using predicted AQS Data, Harris County, Texas, 2002-2004. Posterior means and 95%
credible intervals are displayed.
NC State University 36 / 45TIES 2016
Brian Reich
Graphical Results of Exposure Windows
0 5 10 15 20 25 30 35
−0.
020.
000.
020.
04
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
02−
0.01
0.00
0.01
0.02
0.03
Ozone
Week
Effe
ct
Figure: Susceptible windows of exposure using predicted AQS Data, Harris County, Texas,2002-2004. Posterior means and 95% credible intervals are displayed.
NC State University 37 / 45TIES 2016
Brian Reich
Model Comparison Results
Model pD DIC
Newly Introduced 41.84 17226.00Trimester Average 26.01 17268.06
Basic Multiple Regression 93.96 17267.03
Table: DIC results. pD is the effective number of parameters for the given model.
Differences of more than 10 in DIC rule out the model with a higher value (Lunn etal. 2000).
NC State University 38 / 45TIES 2016
Brian Reich
Sensitivity Analysis Results
Model pD DIC
2 Uniforms 41.84 17226.001 Gamma, 1 Uniform 41.87 17226.10
2 Gammas 41.96 17226.50
Table: DIC results. pD is the effective number of parameters for the given model.
The table summarizes the fit of the newly introduced model using different combinationsof prior distributions for the covariance hyper-parameters.
NC State University 39 / 45TIES 2016
Brian Reich
Extension to Space
Each County Fit Separately: Uniforms for Covariance Parameter Priors
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
Figure: Dallas County (left), Travis County (center), and Harris County (right) critical windowsof exposure.
NC State University 40 / 45TIES 2016
Brian Reich
Extension to Space
Each County Fit Separately: Uniforms for Covariance Parameter Priors
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
Figure: Dallas County (left), Travis County (center), and Harris County (right) critical windowsof exposure.
NC State University 41 / 45TIES 2016
Brian Reich
Results across Space
Fully Spatial Model: Uniforms for Covariance Parameter Priors
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
04−
0.02
0.00
0.02
0.04
0.06
0.08
PM2.5
Week
Effe
ct
0 5 10 15 20 25 30 35
−0.
050.
000.
05
Ozone
Week
Effe
ct
Figure: Dallas County (left), Travis County (center), and Harris County (right) critical windowsof exposure.
NC State University 42 / 45TIES 2016
Brian Reich
Biological Justification
• Exposure to air pollution early in the first trimester may interfere withthe delivery of oxygen and nutrients to the fetus
• Exposure may affect the placental development during the earlystages of pregnancy
• Exposure may trigger inflammation, leading to PTB
• Exact explanation for how pollution affects the fetus not yet identified
• Some studies show that ultra fine particles can enter the mother’slungs and penetrate the lung barriers, entering the bloodstream
• Particles can then travel to the organs, such as the brain andplacenta, and may cause problems for the fetus
References: Ritz and Wilhelm (2008).
NC State University 43 / 45TIES 2016
Brian Reich
Conclusions
• More accurately identified periods during pregnancy where increasedexposure to pollution leads to increased probability of preterm birth
• Multiple pollutants handled jointly
• Increased evidence linking harmful air pollution and birth outcomeswhile extending the results for preterm births
• More information uncovered and made available to the pregnantpopulation which could potentially decrease the rate of preterm birthand lead to healthier birth outcomes in general
NC State University 44 / 45TIES 2016
Brian Reich
Part II
Spatial variable selection
Brian Reich
Department of StatisticsNorth Carolina State University
July 21, 2016
Joint work with Jian Kang and Ana-Maria Staicu
Brian Reich Spatial variable selection
Potential applications of spatial variable selection
I Let β(s) be the regression coefficient at location s
I There are many examples where you might want aseparate slope for each location:
I β(s) is the climate change effect at s
I β(s) is the time trend in air pollution level at s
I β(s) is the health effect of particulate matter at s
I In all cases, we might assume β is smooth and sparse:I Smooth: the spatial process β(s) is continuous in s
I Sparse: β(s) = 0 in many locations
Brian Reich Spatial variable selection
Example 1: Time trends in tropospheric ozone
I The EPA uses a monitoring network to regulate ozone
I Our objective is to identify areas with changing ozone
I We perform a linear regression at each site and study thetime-trend estimates
I The next slide has the first-stage estimates (z-scores)
Brian Reich Spatial variable selection
Example 1: Time trends in tropospheric ozone
−90 −85 −80 −75 −70
2530
3540
45
Longitude
Latit
ude
−4
−2
0
2
Brian Reich Spatial variable selection
Example 2: Health effects of PM
I Fine particulate matter (PM) has been linked with severaladverse health outcomes and is regulated by the EPA
I Literature has shown a spatially-varying effect
I This may be due to variation in PM composition orsusceptibility of the population
I We analyze data for 22 chemical species of PM (elementalcarbon, nitrates, etc)
I The objective is to identify which species are most harmful,and to study spatial variation in the effects
Brian Reich Spatial variable selection
Example 2: Health effects of PM
I Let Yit be the number of Medicare CVD hospitaladmissions on day t in county i
I We analyze the 117 largest US counties and have datafrom 2000-2008
I Our model is
log[E(Yit )] = confounders +
p=22∑j=1
Xitjβj(si)
where Xitj is the value of pollutant j and βj(si) is its effect
I We want to find j and s for which βj(s) 6= 0
Brian Reich Spatial variable selection
Example 3: EEG study of alcoholism
I Goal: Study the relationship between the brain activity asmeasured through EEG signals and genetic predispositionto alcoholism
I Data: 77 alcoholic subjects + 45 controls
I 64 electrodes sampled at 128Hz
I We use spatial variable selection to identify regions mostpredictive of alcoholism
Brian Reich Spatial variable selection
Scaler-on-image regression framework
I Observed data for i th subject
I Yi is scalar outcome
I Xi = (Xi1, ...,Xip)T is an image/array
Model: Yi =∑p
j=1 β(sj)Xij + εi
I β is the coefficient image
I Assumption: β is a sparse and piecewise smooth function
I εiiid∼ N(0, σ2)
I Goal: Estimation and inference of β(s)
Brian Reich Spatial variable selection
Literature on scaler-on-image regression
I Frequentist approaches: Tibshirani (JRSSB1996);Tibshirani et al. (JRSSB05), Tibshirani & Taylor (AoS11);Reiss & Ogden (Bcs10); Wang & Zhu (Bka15)
I No inferential methods available using this approach
I Bayesian approaches: Goldsmith et al. (JCGS14); Li et al.(AoAS15)
I Stability issues due to using two latent processes to modelthe coefficient image
I No smooth transition between the zero areas and non-zeroparts
Brian Reich Spatial variable selection
Soft Thresholded Gaussian Process (ST GP)I Bayesian approach: assume β(s) = gc{Z (s)}
I Z (s) is latent Gaussian Process with zero mean andcontinuous covariance function
I gc is real-valued function with gc(z) = sign(z)(|z| − c) forsome specified threshold c
−4 −2 0 2 4
−3
−2
−1
01
23
z
g_c(
z)
Brian Reich Spatial variable selection
Low-rank spatial model representation
Higdon et al (1999):
Z (s) = Kh(s− τ1)a1 + . . .+ Kh(s− τL)aL
I Kh is local kernel function with Kernel bandwidth h(e.g. tapered Gaussian kernels with bandwidth h)
I The τl ’s are fixed spatial knots
I Dimension is reduced from p to L, which makes it possibleto handle large images
Brian Reich Spatial variable selection
CAR model
I Use conditionally autoregressive (CAR) prior to account forlarge-scale spatial dependence in the al (Nychka, et alJCGS15)
I The CAR prior can be defined by the full conditionaldistribution of one site given all others:
al |ak for all k 6= l ∼ N(ρal , σ2a/ml),
I al is the mean of a at site l ’s ml neighbors
I ρ ∈ (0,1) controls the strength of spatial dependence
I σ2a controls the variance
Brian Reich Spatial variable selection
Illustration - Kernel smoothing (a→ Z )
al
t1
t 2
−2
−1
0
1
2
Zj
s1
s 2
−2
−1
0
1
2
Brian Reich Spatial variable selection
Illustration - Soft thresholding (Z → β)
Zj
s1
s 2
−2
−1
0
1
2
βj
s1
s 2
−0.5
0.0
0.5
Brian Reich Spatial variable selection
Illustration - Sparsity
βj
s1
s 2
−0.5
0.0
0.5
I(βj ≠ 0)
s1
s 2
0.0
0.2
0.4
0.6
0.8
1.0
Brian Reich Spatial variable selection
Full Model
The full model can be written:
Y|X, β, σ2 ∼ N(Xβ, σ2In)
β(s) = gc{Z (s)}
Z (s) = Kh(s− τ1)a1 + . . .Kh(s− τL)aL
a ∼ NL(0, σ2a(M− ρA)−1)
CAR prior: M = diag(m1, . . . ,mL) and A is the adjacencymatrix with (k , l) element equal 1 if k ∼ l and zero otherwise
Brian Reich Spatial variable selection
Advantages/novelties
I The proposed method uses a single spatial process tocontrol both sparsity and smoothness
I As a result there is a gradual transition between zero andnon-zero regions
I Allows full inference and stable computations
I It allows us to study theoretical properties
I Easily extended to incorporate additional covariates orgeneralized responses
Brian Reich Spatial variable selection
Theoretical properties
Proof of large support: Assume the true signal β0(s) is (i)piecewise smooth, (ii) sparse, and (iii) continuous. If thereexists a latent process Z (s) such that β0(s) = gc{Z (s)}. Thenthe STGP β(s) satisfies
Π(‖β(s)− β0(s)‖∞ < ε
)> 0 for all ε > 0
Posterior consistency: Assume regularity conditions for thedesign matrix of Xi ’s, and of kernel K and that true signal β0 isas above. The number of spatial locations p is such thatlog(p) = o(n). Then as n→∞, the posterior distributionsatisfies
Π[‖β(s)− β0(s)‖∞ < ε | Y,X
]→ 1
Brian Reich Spatial variable selection
Simulation study set-upI Model: Yi ∼ Normal
(∑pj=1 β(sj)Xij , σ
2)
I Images are generated on a 30×30 grid so p = 900
I True signal β(s) is either:
5 10 15 20 25 30
510
1520
2530
Five peaks
0.0
0.1
0.2
0.3
0.4
5 10 15 20 25 305
1015
2025
30
Triangle
0.00
0.05
0.10
0.15
0.20
Brian Reich Spatial variable selection
Set-up (cont’d)
The covariates are generated at the p locations using either anexponential correlation or to share structure (“SS”) with βA1. Xi ∼ GP(0,Exp(ρX )); ‘Exp(3)’ or ’Exp(6)’
A2. Xi(s) = eiβ(sj) + 0.5Ui(s)
ei ∼ N(0, τ2) , Ui ∼ GP(0,Exp(3)); ‘SS (2)’ or ‘SS(4)’
5 10 15 20 25 30
510
1520
2530
s1
s2
−2
−1
0
1
2
5 10 15 20 25 30
510
1520
2530
s1
s2
−2
−1
0
1
2
5 10 15 20 25 30
510
1520
2530
s1
s2
−2
−1
0
1
Brian Reich Spatial variable selection
Performance evaluation
I Assess performance using MSE and computing time
I We compare ST GP with:
I Lasso (Tibshirani, JRSSB1996)
I Fused lasso (Tibshirani et al JRSSB05, Tibshirani & TaylorAoS11)
I fPCR: smoothing the image Xi first (Xiao et al JRSSB13)and then doing functional principal component regression
I Ising: Bayesian Scalar on Image regression (Goldsmith etal JCGS14)
I GP: the non-threshold GP prior
Brian Reich Spatial variable selection
Results: MSE (× 1,000)
Sample size n = 100, standard deviation of conditionalresponse σ = 5. Results based on 100 simulations.
True β Cov(X) Fused lasso fPCR Ising GP ST GP5 peaks Exp(3) 18.48 3.67 4.44 2.63 1.65
Exp(6) 2.66 3.33 4.14 2.07 1.93Triangle Exp(3) 18.08 1.83 2.75 1.80 0.82
Exp(6) 4.32 1.63 2.64 1.76 0.88SS(2) 70.65 0.98 2.77 3.28 1.40SS(4) 71.23 0.34 3.18 3.39 1.81
Brian Reich Spatial variable selection
Results: Time (minutes) when n = 100
True β Cov (X ) Fused lasso fPCR Ising GP ST GP5 peaks Exp(3) 16.77 5.40 27.61 4.81 17.69
Brian Reich Spatial variable selection
Recall EEG dataI Yi = 1 for alcoholics and Yi = 0 otherwise
I Xi is a 60×128 image
20 40 60 80 100 120
1020
3040
5060
Mean − Alcoholics
Time
Nod
e
−6
−4
−2
0
2
4
6
20 40 60 80 100 12010
2030
4050
60
Mean − Non−alcoholics
Time
Nod
e
−6
−4
−2
0
2
4
6
I Goal: Study EEG correlates of genetic predisposition toalcoholism
Brian Reich Spatial variable selection
Implementation details
I Probit regression: Prob(Yi = 1|Xi , β) = Φ[∑
j Xijβ(sj)]
I We use knots in every other column and row, with adifferent CAR dependence parameter in each direction
I The prior for the threshold c is somewhat informative,
c ∼ Uniform(1.43,1.96)
I This gives about 5-15% inclusion probability
Brian Reich Spatial variable selection
Estimated β(s)
20 40 60 80 100 120
1020
3040
5060
Lasso
Ele
ctro
de
−0.05
−0.04
−0.03
−0.02
−0.01
0.00
20 40 60 80 100 120
1020
3040
5060
fPCA
−0.02
−0.01
0.00
0.01
0.02
Brian Reich Spatial variable selection
Estimated β(s)
20 40 60 80 100 120
1020
3040
5060
GP
Time
Ele
ctro
de
−0.06
−0.04
−0.02
0.00
0.02
0.04
20 40 60 80 100 120
1020
3040
5060
STGP
Time
−0.030
−0.025
−0.020
−0.015
−0.010
−0.005
0.000
Brian Reich Spatial variable selection
STGP estimates
0 20 40 60 80 100 120
0.0
0.2
0.4
0.6
0.8
1.0
Time
Pos
t pro
b no
nzer
o
Post prob non−zero, Time 43
0.0
0.2
0.4
0.6
0.8
Brian Reich Spatial variable selection
STGP estimates
Posterior mean, Time 43
−0.04
−0.03
−0.02
−0.01
0.00
0.01
Posterior mean, Time 44
−0.04
−0.03
−0.02
−0.01
0.00
0.01
Brian Reich Spatial variable selection
ROC from 5-fold CV
Specificity
Sen
sitiv
ity
0.0
0.2
0.4
0.6
0.8
1.0
1.0 0.8 0.6 0.4 0.2 0.0
Lasso (AUC = 0.77)fPCA (AUC = 0.797)GP (AUC = 0.788)STGP (AUC = 0.833)
Brian Reich Spatial variable selection
Recall: Health effects of PM
I Let Yit be the number of Medicare CVD hospitaladmissions on day t in county i
I We analyze the 117 largest US counties and have datafrom 2000-2008
I Our model is
log[E(Yit )] = confounders +
p=22∑j=1
Xitjβj(si)
where Xitj is the value of pollutant j and βj(si) is its effect.
I We want to find j and s for which βj(s) 6= 0
Brian Reich Spatial variable selection
Posterior mean effect by site and pollutant
−1.0
−0.5
0.0
0.5
1.0
NE Mid Atl S Atl ESC WSC ENC WNC Mtn Pac
SulfNitrSili
El COr CSodi
AmmoAlumArse
BromCalcChloChroCopp
IronLeadMagn
NickPotaTita
VanaZinc
Brian Reich Spatial variable selection
Posterior EC effects grouped by region
●●●●●●
●●●●●●●●●●●●●●●●●● ●
●●●●●●●●
●●●●
●●●●●●●●●
●
●●●●
●●●
●●●●●
●
●
●●
●●●
●
●●●●●●
●●●●●●
●●●●
●●●●●
●●●●●●●●●●●
●●
●
●●●●●●
●
●●●●●●
−1
01
23
β ●●●●●● ●●●●
●●●●
●●●●●●●●●● ●
●●●●●●●●●
●●●●●●●●
●●●
●●
●●●●
●●●
●●●●●●
●●●
●●●
●
●●●●●●
●●●●●●
●●●●
●●●
●●●●●
●●●●●●
●●●●
●
●●●●●●
●
●
●●
●●●
NE Mid Atl S Atl ESC WSC ENC WNC Mtn Pac Avg
Brian Reich Spatial variable selection
Discussion
I Soft Thresholded Gaussian Process-based modeling forhigh dimensional regression, where the signal is sparseand piecewise smooth
I Single process to control the smoothness and sparsity ofthe signal has computational advantages and allows tostudy theoretical properties
I Low rank representation of the latent process allows themethod to be applicable for high-dimensional predictors
I We have also applied this method in applications withmultiple covariates, and responses at each spatial location
Brian Reich Spatial variable selection
THANK YOU !
For comments or questions, please contact us [email protected]
Thanks to NIH, NSF, and EPA for financial support
Brian Reich Spatial variable selection