Model Based Geostatistics Archie Clements University of Queensland School of Population Health.
-
Upload
dylan-baker -
Category
Documents
-
view
215 -
download
1
Transcript of Model Based Geostatistics Archie Clements University of Queensland School of Population Health.
Model Based Geostatistics
Archie Clements
University of Queensland
School of Population Health
Overview• Introduction to geostatistics
– Assumptions– Variogram components– Variogram models– Kriging– Assumptions
• Model-based geostatistics– Principles– Building the model– Prediction– Validation
• Applications: parasitic disease control in Africa
Spatial variation
Z
Y
X
First and second order variation
• First-order variation:– Trend– Large-scale variation– Can be due to large-scale environmental drivers (e.g. temperature
for vector-borne diseases)
• Second-order variation:– Localised variation: clustering– Modelled using geostatistics
Spatial dependence
• Observations close in space are more similar than observations far apart
• The variance of pairs of observations that are close together (small h) tends to be smaller than the variance of pairs far apart (large h)
• Basis of the semivariogram– Spatial decomposition of the sample variance
Semivariance: statistical notation
Function of distance (and direction); distance in bins, direction in sectors of compass – “azimuth”
Semivariance is half the average squared difference of values observed at locations separated by a given distance (and direction)
Modelling spatial correlation: semivariogramS
emiv
aria
nce
Lag (h)
Nugget
Partial Sill
Sill
Nugget
• Random variation (white noise); non-spatial measurement error
• Microvariation (spatial variation at a scale smaller than the smallest bin)
• If no spatial correlation:– Nugget = sill (flat semivariogram)
Semivariogram: decisions to be made
• How many/what sized bins?– Depends on density of data points– For regular-spaced (grid-sampled) data bin size = size of
cells in the grid– For irregular sampling – modify according to range of
spatial correlation (big range, big bins; small range, small bins)
• What maximum lag(h) to use?– Should be estimated up to half the length of the shortest
side of study area• Which parametric model to use?
– Visual fit– Statistical fit
Variogram models
Schistosoma mansoni, Uganda
Omnidirectional semivariograms
Anisotropy• Spatial dependence is different
in different directions– Semivariogram calculated in one
direction is different from semivariogram calculated in another direction
– Should check for anisotropy and, if present, accommodate it in interpolation
– Range or sill (or both) can differ
Schistosoma mansoni, Uganda: directional semivariograms
DirectionRange
(km)Sill Nugget
Omni-directional
43.4 7E-2 4E-2
0˚ 39.4 1E-1 -3E-3
45˚ 43.6 7E-2 2E-2
90˚ 35.8 8E-2 3E-2
135˚ 39.5 1E-1 2E-2
Schistosoma haematobium, Northwestern Tanzania
Direction Range (km)
Sill Nugget
Omni-directional
36.0 5E-2 0
0˚ 260.1 2E-2 3E-2
45˚ 163.9 6E-3 3E-2
90˚ 56.2 5E-2 0
135˚ 97.7 3E-2 7E-3
Schistosoma haematobium, Northwestern Tanzania
Trended and skewed data• Data should be de-trended
– Polynomials (regression on XY coordinates)– Generalised linear models (regression on covariates)– Generalised additive models (can over-fit)– If directional variograms are calculated & range in one
direction is >3 X range in perpendicular, sign of trend
• If skewed, consider transformation (e.g. log transformation, normal score transformation)– Otherwise, extreme values overly influence interpolated
map– Have to back-transform interpolated values– Called “disjunctive Kriging”
Non-stationarity
• Spatial correlation structure cannot be generalised to the whole study area
• Why does it occur?– Different factors may operate in different parts of the
study area– Different ecological zones with different disease
epidemiology• Need to estimate the spatial correlation structure separately
in each homogeneous zone
KrigingZ(si) is the measured value at the ith location
λi is the weight attributed to the measured value at the ith location (calculated using semivariogram)
So is the prediction location
For formulae on how the weights are estimated using the variogram:http://en.wikipedia.org/wiki/Kriging
Prediction standard error/variance gives an indication of precision of the prediction
Geostatistics summary
• Geostatistics involves 3 steps:– Exploratory data analysis– Definition of a variogram– Using the variogram for interpolation (Kriging)
• Technique applicable for:– Point-referenced data– Spatially continuous processes:
• Disease risk• Rainfall, elevation, temperature, other climate variables• Wildlife, vegetation, geology (mineral deposits)
Bayesian model-based geostatistics
Seminal paper:Diggle, Tawn and Moyeed (1998). Model-based geostatistics. Appl.
Stat. 47:3;299-350
Observed a need for addressing non-Gaussian observational error
Idea is “to embed linear Kriging methodology within a more general distributional framework”
Generalised linear models with an unobserved Gaussian process in the linear predictor
Implemented in a Bayesian framework
Advantages of the Bayesian approach
• Natural framework for incorporation of parameter uncertainty into spatial prediction– Can build uncertainty into parameters using
priors• Non-informative
• Informative (based on exploratory analysis, additional sources of information)
• Convenient for modelling hierarchical data structures
Bayesian model-based geostatistics
Predictions
• Can predict at specified validation locations (with observed outcomes for comparison)
• Can predict at non-sampled locations, e.g. a prediction grid
• Might be interested in – outcome– spatial random effect– Standard error of predicted outcome
Validation• Jack-knifing; sampling with replacement
– Remove one observation, do prediction at that location and store predicted value
– Repeat for all observations– Compare predicted to observed using statistical measures of fit
(RMSE) and discriminatory performance (AUC)– Not feasible with MBG other than with v. small datasets
• Cross-validation; sampling without replacement– Set aside a subset for validation (ideally 50%)– Use remaining data to “train” model – Compare predicted and observed for the validation subset using
statistical measures– Can then recombine the validation and training subsets for final
model build• External validation: using other prospective or retrospective
dataset
Model-based geostatistics summary
Model-based geostatistics involves:1. Visual and exploratory data analysis
2. Variography (to determine if there is second-order spatial variation)
3. Variable selection (for deterministic component)
4. Building model (e.g. in WinBUGS)
5. Model selection (e.g. using DIC)
6. Prediction and validation
Application: Schistosomiasis in Sub-Saharan
Africa
Schistosomiasis
779 million people at risk
207 million infected
Most in Africa
Significant illness and mortalityTwo main forms in Africa:
Urinary schistosomiasis caused by Schistosoma haematobium
Intestinal schistosomiasis caused by S. mansoni
Life cycle of Schistosoma haematobium
Cercariae releasedAdult worm in human bladder wall
Sporocysts in snail Eggs in urine
×Miracidia
Diagnosis of infection
S. haematobium:
Microscopic examination of urine slides: Presence of eggs and egg counts
Macrohaematuria (visible blood)
Microhaematuria (invisible blood) – tested using chemical reagent strips
Blood in urine questionnaire
S. mansoni and soil-transmitted helminths:
microscopic examination of stool samples
School-based control programmes• School-aged children have highest prevalence
(proportion infected) and intensity (severity) of infection • Education system is convenient for control; central
location to access target population
How do we determine which schools should be targeted?
• No surveillance
• Need to do surveys
• World Health Organisation guidelines: treat communities biannually where prevalence in school-age children is >10% and annually where prevalence >50%
Field survey: northwest Tanzania
Lake Victoria
153 schools surveyed
60 children per school
What about non-sampled locations? Need to predict (interpolate) values
MBG model for S. haematobium prevalence
),(~ iii pnbinomialY
iiiii LSTLSTrain 21
iiplogit )(
)(exp);( ijiji ddf
Variable Coefficient Odds Ratio
Intercept 1.9 (-2.3 - 10.3) –
LST >35-39C 0.4 (-0.3 - 1.1) 1.5 (0.8 - 2.9)
LST >39C 0.3 (-1.5 - 2.2) 1.4 (0.2 - 8.6)
Rainfall >1050mm -1.1 (-3.4 - 1.1) 0.3 (3.3 x 10-2 - 3.1)
к 0.9 (0.6 - 1.3) –
φ 0.2 (0.1 - 1.0) –
S. haematobium model results
Clements et al. TMIH 2006
Uncertainty
Lower bound: 95% PI
Upper bound: 95% PI
Probability that prevalence is >50%Clements et al. EID 2008
Co-ordinated surveys in 3 contiguous countries•418 schools•>26,000 children
Variable Mean (95% CI) SD
Sex: Female 0.70 (0.65, 0.76) 0.03
Age: 9–10 years 1.16 (1.00, 1.33) 0.08
Age: 11–12 years 1.51 (1.31, 1.73) 0.10
Age: 13–16 years 1.79 (1.53, 2.06) 0.14
Distance to perennial water body 0.34 (0.21, 0.54) 0.08
Land surface temperature 0.80 (0.51, 1.21) 0.18
Land surface temperature2 1.10 (0.85, 1.40) 0.14
Rate of decay of spatial correlation 2.03 (1.48, 2.74) 0.32
Variance of the spatial random effect (sill) 7.03 (5.36, 9.31) 1.01
Slide 38
#
#
#
#
#
#
#
#
#
#
# #
##
#
#
#
#
## #
#
#
###
#
#
#
#
#
#
#
#
#
#
##
##
#
#
#
#
#
#
#
# #
#
#
#
#
#
## #
#
##
##
#
#
#
#
# #
#
#
#
#
# #
#
#
###
#
#
#
#
###
##
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
###
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
####
#
#
#
#
#
#
#
##
#
##
##
#
##
#
#
#
#
##
#
##
#
#
#
#
#
##
#
#
# #
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
## #
#
#
#
#
#
#
#
#
#
#
#
#
##
#
#
# #
#
##
#
#
#
#
#
#
#
#
#
#
#
##
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# #
#
#
#
#
#
100 0 100 200 300 Kilometers
Large water bodiesCountry borders
Infection statusNo infectionS. mansoni monoinfectionHookworm monoinfectionCoinfection
N
KENYA
TANZANIA
UGANDA
Lake Victoria
LakeAlbert
0
5
10
15
20
25
30
35
40
45
50
0 10 20 30 40 50 60 70 80 90 100
% infected
Per
cent
ag
e of
sch
ool
s
S. mansoni mono-infectionHookworm mono-infectionCo-infection
East Africa: Brooker and Clements, Int. J. Parasitol., in press
S. mansoni mono-infection: 7.9%
Hookworm mono-infection: 40.5%
Co-infection: 8.1%
Other outcomes: co-infection
Model for co-infection
)(exp);( ijijik ddf
kijk
ijkijk np
,
Yijk~Multinomial(pijk,nijk),
T
kNikNijkNkkijk xlog
,1
)(
Variable
S. mansoni mono-infection
posterior mean (95% posterior CI)
Hookworm mono-infection
posterior mean (95% posterior CI)
S. mansoni/hookworm co-infection
posterior mean (95% posterior CI)
Intercept -3.8 (-4.7 - -2.9) -0.6 (-1.1 - -0.3) -4.4 (-5.0 - -3.7)
OR: Elevation 0.35 (0.22 - 0.58) 0.77 (0.65 - 0.89) 0.30 (0.20 - 0.47)
OR: DPWB 0.23 (0.10 - 0.45) 0.94 (0.76 - 1.15) 0.30 (0.18 - 0.58)
OR: Rural vs urban 0.43 (0.21 - 0.79) 0.98 (0.68 - 1.37) 0.61 (0.36 - 1.02)
OR: Ext. rural vs urban 0.62 (0.23 - 1.44) 1.16 (0.82 - 1.81) 0.75 (0.31 - 1.62)
OR: LST 0.88 (0.62 - 1.25) 0.60 (0.50 - 0.72) 0.57 (0.31 - 0.87)
OR: Female 0.86 (0.76 - 0.96) 0.91 (0.86 - 0.97) 0.70 (0.63 - 0.77)
OR: Age (9-10 years) 1.67 (1.37 - 2.06) 1.17 (1.04 - 1.30) 1.82 (1.52 - 2.21)
OR: Age (11-13 years) 2.44 (2.06 - 2.89) 1.55 (1.39 - 1.71) 2.99 (2.55 - 3.52)
OR: Age (≥14 years) 2.87 (2.19 - 3.71) 1.88 (1.63 - 2.14) 3.83 (3.01 - 4.86)
Phi (rate of decay) 3.52 (1.73 - 7.21) 4.98 (3.38 - 7.33) 3.76 (2.10 - 7.36)
Sill 6.39 (3.52 - 11.78) 1.31 (0.98 - 1.76) 6.34 (3.98 - 9.95)
Co-infection
S. mansoni monoinfection
Hookwormmonoinfection
S. mansoni - Hookworm coinfection
Other outcomes: Intensity of infection
Prevalence is used (currently) for disease control planningIntensity of infection (eggs/ml urine or /g faeces) is more indicative of:
Morbidity (anaemia, urine tract, hepatic pathology)Transmission
Model for intensity of infection
)(~ ijij munegbinY
jjjj elevdist
)(exp);( jkjkj ddf
jijij girlmulog )(
Intensity of S. mansoni
infection, East Africa
Slide 44
Clements et al. Parasitol 2006
Variable Posterior Mean (95% CI)
Intercept 10.06 (5.77 - 13.22)
Female -0.41 (-0.72 - -0.11)
Elevation (m) -0.007 (-0.01 - -0.004)
DPWB (dec deg) -5.36 (-7.51 - -3.30)
Sill 23.96 (19.06 - 32.07)
Range 0.134 (0.09 - 0.20)
Overdispersion 0.06 (0.058 - 0.062)
Conclusions
• In disease control we need evidence-based framework for deciding on where to allocate limited control resources
• Maps are useful tools for highlighting sub-national variation; targeting interventions; advocacy (national and local); integrated control programmes; estimating heterogeneities in disease burden
• Model-based geostatistics enables rich inference from spatial data; uncertainty