Combining principal component analysis with parameter line...

Environ Ecol StatDOI 10.1007/s10651-014-0297-0

Combining principal component analysis withparameter line-searches to improve the efficacyof Metropolis–Hastings MCMC

David A. Kennedy · Vanja Dukic · Greg Dwyer

Received: 13 May 2013 / Revised: 23 April 2014© Springer Science+Business Media New York 2014

Abstract When Markov chain Monte Carlo (MCMC) algorithms are used with com-plex mechanistic models, convergence times are often severely compromised by poormixing rates and a lack of computational power. Methods such as adaptive algo-rithms have been developed to improve mixing, but these algorithms are typicallyhighly sophisticated, both mathematically and computationally. Here we present anonadaptive MCMC algorithm, which we term line-search MCMC, that can be usedfor efficient tuning of proposal distributions in a highly parallel computing environ-ment, but that nevertheless requires minimal skill in parallel computing to implement.

Handling Editor: Pierre Dutilleul.

Electronic supplementary material The online version of this article (doi:10.1007/s10651-014-0297-0)contains supplementary material, which is available to authorized users.

D. A. Kennedy · G. Dwyer (B)Department of Ecology and Evolution, University of Chicago, Chicago, IL, USAe-mail: [email protected]

D. A. Kennedye-mail: [email protected]

D. A. KennedyCenter for Infectious Disease Dynamics, Pennsylvania State University, University Park,PA, USA

D. A. KennedyFogarty International Center, National Institutes of Health, Bethesda, MD, USA

V. DukicDepartment of Applied Mathematics, University of Colorado - Boulder, Boulder,CO, USAe-mail: [email protected]

123

http://dx.doi.org/10.1007/s10651-014-0297-0

Environ Ecol Stat

We apply this algorithm to make inferences about dynamical models of the growthof a pathogen (baculovirus) population inside a host (gypsy moth, Lymantria dispar).The line-search MCMC appeal rests on its ease of implementation, and its potentialfor efficiency improvements over classical MCMC in a highly parallel setting, whichmakes it especially useful for ecological models.

Keywords Birth–death model · MCMC · Parameter line-search · Survival-timedata · Within-host model

1 Introduction

Advances in computing power have increased the utility of Metropolis–HastingsMarkov-chain Monte Carlo (MCMC), making parameter estimation possible evenfor highly complex models (van den Berg et al. 2006). In many cases, however, eitherthe number of model parameters or the computational cost of calculating the like-lihood is still too high for the basic MCMC algorithm to be practically useful. Thecomplexity of the likelihood and the associated posterior density often hamper thestatistician’s ability to find efficient proposal densities, and consequently, lead to longburn-in periods and long run times. This limitation could be substantially mitigatedwith well-designed proposal distributions, but designing such distributions is oftenchallenging. Parallel computing resources are becoming increasingly common, as theper-unit costs of computer processors have drastically declined over the last decadeand clusters of large numbers of linked processors are now widely available (Fullerand Millet 2011; Geer 2005). The basic MCMC algorithm, however, cannot easilytake advantage of these computing resources.

The construction of MCMC algorithms suitable for parallel environments is there-fore a major focus of current research in statistical computing (Brockwell 2006; Craiuet al. 2009; Jacob et al. 2011; Rosenthal 2000; Strid 2010; Solonen et al. 2012; Wilkin-son 2005; Yan et al. 2007; Miller 2010). Existing parallel MCMC algorithms, however,typically do not use parallel environments for proposal design, and require investigator-guided tuning of the proposal. Notable exceptions are the algorithms of Craiu et al.(2009), Miller (2010), and Solonen et al. (2012). Nevertheless, in practice, implement-ing and trouble-shooting complex parallel MCMC algorithms can be quite difficult(Wilkinson 2005). Here we present an automated and easily implemented algorithmthat makes use of parallel environments to design independent Metropolis–Hastingssampler proposal distributions. As only a single tuning phase is required, our algorithmis expected to reduce the computational cost of running multiple MCMC chains. In anexample using a stochastic dynamical model from ecology, we show that the samplesfrom the line-search MCMC satisfy convergence tests after fewer iterations than forthe popular adaptive algorithm of Haario et al. (2001). This is important because sto-chastic, dynamic models are widely used in ecology (Kot 2001), but likelihoods forsuch models can often only be calculated through computationally costly simulation.Our algorithm provides a step towards automated tuning of MCMC proposals in away that can be implemented in a highly parallel environment without requiring a

123

Environ Ecol Stat

sophisticated approach to parallel computing. We therefore argue that our algorithmcould be generally useful to the biological and ecological modeling communities.

Our algorithm combines MCMC with an alternative model-fitting algorithm knownas parameter line search (hereafter referred to as “line search”). In line search, a singleparameter is allowed to vary while all other parameters are held fixed. The targetparameter is then set to the value that yields the highest likelihood or posterior density,and the process is repeated until each parameter has been varied a pre-defined numberof times (Luenberger and Ye 2008). Although line search is generally less useful thanMCMC, it can be easily implemented in parallel. Moreover, as line search is a slope-climbing algorithm it will often return different local optima for different startingconditions (Press et al. 1992). Within the context of line search, these inconsistentresults are hard to interpret, but in MCMC multiple peaks or multiple modes have aclear interpretation. By combining the two algorithms, we attempted to make use ofthe advantages of each.

The line-search MCMC algorithm is thus implemented as follows. In the first stage,we implemented a large number of line searches to obtain a very rough map of thelikelihood surface. Because each line search is independent of the others, this initialstep can be easily implemented in a highly parallel environment. We then use the roughmap to construct better informed proposal distributions for the Metropolis–Hastingssteps of the global MCMC algorithm. To do this, we note first that although it isexpected that the large number of line search results will identify regions with relativelyhigh likelihoods, the parameter vectors in those regions will often be highly correlated.In the second stage of the algorithm, we therefore use a principal components analysis(PCA) to transform sets of model parameters from the first stage into uncorrelatedlinear combinations, or principal components. In the final stage of the algorithm,we use the distribution of principal components as the proposal distributions in theMetropolis–Hastings steps of an MCMC algorithm. Because the algorithm uses linesearch to construct proposal distributions for MCMC, we refer to it as “line-searchMCMC”.

2 Overview: fitting ecological models of population growth to data

In Bayesian inference for biological systems, efforts to improve MCMC efficiencyfrequently rely on explorations of the posterior density. This can be a non-trivial taskin cases for which estimating likelihoods is computationally expensive, as they oftenare in ecological modeling. Ecological models are often based either on deterministicnonlinear dynamical models, such as coupled nonlinear differential-equations thatcan only be integrated numerically (Braun 1983), or stochastic nonlinear dynamicalmodels, which usually can only be solved via simulation (Karlin and Taylor 1975).Calculating likelihoods is therefore often computationally expensive.

For ecological models, this problem is often exacerbated by correlations betweenparameter estimates, which can lead to poor mixing in MCMC chains. Such cor-relations are perhaps best known in conservation biology, when models of density-independent growth are fit to time series of population data. In such cases, a particularpopulation growth rate may be explained either by high fecundity and high mortality

123

Environ Ecol Stat

or by low fecundity and low mortality (Doak and Morris 2010). Because of the impor-tance of this problem, in demonstrating the usefulness of our algorithm, we first applyit to a linear birth–death model.

Strong correlations between model parameters are also often encountered in model-ing interactions between species. In such cases, the data used to fit the model typicallyprovide information about only one species, and so a goal of the model-fitting approachis often to infer the effects of other species (Turchin 2003). For example, infectiousdisease data typically consist of observations of the fraction of hosts infected, whichresearchers use to make inferences about models that include interactions betweenhealthy, infected and recovered hosts, as well as rates of transition between theseinfection categories (Ionides et al. 2006). The problem of highly correlated parame-ters is then often worse than in conservation biology, because a larger number ofparameters are being fit to very limited data. In our second example of the usefulnessof our algorithm, we therefore apply our algorithm to a species interaction model.

Both of our examples were originally motivated by our work on the populationgrowth of infectious pathogens inside hosts (Kennedy et al. 2014), but the partic-ular models that we use are effectively general ecological models. For baculoviruspathogens of insects in particular, infection often leads to death (Cory and Myers2003), and data on the time between infection and death are widely available. Modelsof the time when a population hits an upper threshold are useful for understandingpathogen population growth because disease symptoms often appear only after anincubation period, during which a pathogen population has grown from a low pop-ulation size to a high population size (Antia et al. 1994; Armenian and Lilienfeld1983; Saaty 1961). If the population size reaches zero, the host recovers, whereas ifthe population size reaches a pre-specified threshold, the host begins to show symp-toms at the time the threshold is crossed. We emphasize, however, that the problemof estimating the time when a population reaches an upper threshold is not specific tostudies of pathogen growth. To the contrary, the ability to predict when populationshit upper thresholds is of crucial importance in conservation biology for developingspecies recovery plans (Brigham et al. 2002), and in integrated pest management forassessing when pest populations rise to so-called “economic thresholds”, at which thedamage due to the pest outweighs the cost of control (Feng et al. 2010).

By using both a linear birth–death model and a species-interaction model, we allowfor a wide range of biological complications. Moreover, an important contrast betweenthe two models is that, for the linear birth–death model, it is possible to derive anexpression for the time at which the model population first crosses the threshold interms of a Bessel function (Shortley 1965), which allows for highly accurate calcula-tion of the likelihood function. For the species interaction model in contrast, we canonly generate a distribution of model predictions by repeatedly simulating the model,which in turn means that we can only approximate the likelihood, and getting thisapproximation is computationally expensive.

We therefore use the linear birth–death model to test our algorithm using both thehighly accurate likelihood computation provided by numerical calculation of a Besselfunction, and for an approximate likelihood generated using a stochastic samplingalgorithm, to show that the estimated likelihood has only a very slight bias. We thenfurther compared our line search MCMC algorithm to another MCMC-based algo-

123

Environ Ecol Stat

rithm, the “adaptive Metropolis MCMC” or “AM” algorithm (Haario et al. 2001),which is designed to improve MCMC performance by instead iteratively adjustingthe proposal variance and covariance as the MCMC routine proceeds. For the case inwhich the likelihood is estimated using simulations, the AM algorithm yields a pos-terior that is considerably more biased than that produced by the line-search MCMCalgorithm, while additionally producing less efficient mixing.

Given the success of the line search MCMC algorithm when fitting the linear birth–death model, we then fit a species interaction model to data. In addition to having morethan twice as many parameters as the birth–death model (5 vs. 2), for the species inter-action model we can only generate a distribution of model outcomes using simulation.We then fit the model to previously published data on the time between infection anddeath in baculovirus-infected gypsy moth larvae (Kennedy et al. 2014), such that thevirus is similar to a prey population in our model, and the immune cells are similar toa predator population, following standard models in this area (King et al. 2009). Thesuccess of our algorithm at fitting this more complicated model to data suggests thatthe algorithm could be of general usefulness in ecological modeling.

3 Linear birth–death model

3.1 Model structure

To explain the processes in the model, it is useful to write the model down in termsof transition probabilities from time t to time t + �t , where �t is taken to be smallenough, so that while possible, observing more than one event in the �t time intervalis far less likely than observing no events or a single event.

P(xt+�t = xt + 1|xt ) = λxt�t + o(�t), (1)

P(xt+�t = xt − 1|xt ) = μxt�t + o(�t), (2)

P(xt+�t = xt |xt ) = 1 − (λ + μ)xt�t + o(�t). (3)

Here, xt is the number of organisms at time t , λ and μ are the birth and death rates, ando(·) is the “little-o” Landau notation. The probability of a birth P(xt+�t = xt +1|xt ) ora death P(xt+�t = xt −1|xt ) increases nearly linearly with either the population size orthe size of the time interval �t . In the context of infectious pathogens, Shortley (1965)showed that it is convenient to re-parameterize the model according to α = λ − μ

and γ = λln(2)λ−μ

. With this new parameterization, α is the net replication rate of thepathogen, and γ is the population size (“dose”) of pathogen required to kill 50 % ofhosts, often referred to as the 50 % lethal dose or LD50.

To mimic the variation in initial pathogen numbers typical of real data, we assumethat the initial population size follows a Poisson distribution with a known mean. Wealso allow for lower and upper threshold population sizes. In the context of conserva-tion biology, the lower threshold corresponds to extinction while the upper thresholdmay correspond to the cessation of strict legal protection, which in the US meansthat the organism has been removed from the endangered species list (Brigham et al.2002). In the context of integrated pest management, the upper threshold corresponds

123

Environ Ecol Stat

to the so-called “economic threshold”, at which the cost of pest management is lessthan the damage that the pest causes to the crop (Feng et al. 2010). In the context ofbaculovirus experiments, the upper threshold corresponds to the death of the insect,while the lower threshold corresponds to complete recovery of the host. The boundaryconditions are then:

x0 ∼ Poisson(D), (4)

xtkill = N . (5)

Here D is the mean initial population size, and N is the upper threshold populationsize, while the lower threshold is 0. We assume that N is fixed, because in applicationsit is typically either dictated by conservation considerations, or estimated from otherdata.

3.2 Fitting routine

We then generated a simulated dataset using the well-known Gillespie algorithm (Doob1945; Gillespie 1977). Because our initial motivation came from baculovirus exper-iments, we used particular initial population sizes that correspond, for example, toinitial doses in an experiment (Kennedy et al. 2014). More generally, however, weare assuming that the data are generated by a range of initial population sizes, as isoften the case in conservation biology (Doak and Morris 2010) and pest management(Bogich and Shea 2008), although in the latter cases the initial population sizes areunlikely to be found at the regular intervals used in experiments. To calculate themodel prediction of the time to hit the upper threshold, for each initial population size,we generated 2,000 realized population trajectories. For any trajectory in which thepopulation size did not hit the upper threshold within 25.5 time units, correspondingto days in baculovirus experiments or years in conservation biology, the trajectory wasassigned to a separate end bin. At the end of this binning procedure, we had data onthe number of outcomes that fell into each of K = 52 bins for each initial populationsize, with the 52nd bin representing censored observations. We then generated timesto hit the upper threshold, which can be calculated to an arbitrary degree of precisionusing sampling from a Bessel function derived by Shortley (1965), or using simu-lation. Because the simulations are computationally expensive, we generated modelrealizations using a sampling-importance-resampling algorithm (see Appendix 1).

Because the data are binned into K = 52 time bins, and because the trajectoriesare independent, we can use a multinomial distribution to describe the distributionof outcomes. The simulations can then be used to estimate the parameters pi of themultinomial density, representing the probability of an observation landing in bin i ,as sample proportions of realizations falling in bin i :

pi = 2ri + 1

K + 2∑K

j=1 r j. (6)

123

Environ Ecol Stat

Here, pi is the probability that a speed-of-kill observation falls into bin i (i = 1, . . . 52),ri is the number of realizations in bin i , and K is the number of bins. Note that thisequation includes a slight adjustment to ensure that pi > 0 even if no realizations fellinto bin i . As the number of realizations becomes large, however, the effect of thisadjustment becomes small. The likelihood from the multinomial distribution is then:

L(α, γ |d1, d2, . . . , dK ) = n!d1!d2! . . . dK ! pd1

1 pd22 . . . pdk

K . (7)

Here α and γ are the model parameters, di is the number of deaths in bin i , and n is thetotal number of host insects. Note that the multinomial parameters pi are calculatedfrom the simulation output, which in turn depends on the model parameters α and γ .The parameters pi are thus implicit functions of the model parameters, which is whywe write the likelihood L as a function of α and γ .

We then use Bayes’ theorem to find the posterior density:

P(α, γ |data) ∝ π(α)π(γ )L(α, γ |data). (8)

Here, P(α, γ |data) is the joint posterior density of the birth–death parameters, π(α)

and π(γ ) are the respective prior densities of the two parameters, and L(α, γ |data) isthe joint likelihood of the parameters given the data from Eq. (7).

We then calculated the posterior density given in Eq. (8) using the line-searchMCMC algorithm. For the initial line search, which is effectively the “tuning” stage,we first selected a wide range of biologically plausible values for each model parameter,based on our knowledge of baculovirus biology. In practice, it could be useful to setthe range of the line searches using the model priors.

We then started a large number of line searches, each from a randomly chosen initialparameter set from the range used in the line searches. Within each line search, wecalculated the maximum conditional likelihood across that parameter’s range near theold value, while temporarily keeping all other parameters fixed, successively varyingeach parameter in turn (note that we could have used the maximum posterior withoutany meaningful change in the algorithm). Likelihoods were then computed based on300 realizations, until each parameter had been varied 100 times, after which a finalestimate of the likelihood was based on 104 realizations. Crucially, this procedure canbe easily implemented in parallel, such that in practice we ran 1,582 simultaneous linesearches on a computing cluster that contained 296 processors. Moreover, a crucialfeature of this step is that each realization was run on a separate processor, obviating theneed for a message passaging interface to allow communication between processors.The code for the line search step therefore did not require advanced programmingtechniques, and so it can be understood quite easily (all of our code is available assupplementary material).

Because many line searches failed to find parameter sets of high likelihood, indesigning a proposal to approximate the posterior, we used only the parameter setswith the 50 highest likelihoods. As is typically the case for ecological models, theresulting parameter vectors were highly correlated. In the second step in our algorithm,we therefore carried out PCA (Jolliffe 1986) on the 50 best line-search vector results,

123

Environ Ecol Stat

after first log-transforming parameter values to reduce non-normality. The PCA routinewas implemented in the R statistical language (Development Core Team 2009) usingthe function ‘prcomp’, and using parameter values centered at 0, and variances scaledto 1. The means and variances calculated from the observed principal components thenprovided the means and variances for the univariate-normal proposals that we used asproposal distributions for the Metropolis–Hastings algorithm.

The line search MCMC algorithm is thus based on the idea that a posterior distri-bution can be usefully explored by a large number of line searches. This may seemcounterintuitive, because line search is a slope climbing algorithm, and so for cases inwhich likelihoods can be calculated exactly or at least with low computational expense,all line searches should end up at features such as local optima, ridges, or saddles.In such cases, the parameter sets are unlikely to provide a useful approximation tothe posterior distribution, because almost all searches will find the same few features.Nevertheless, the heuristic basis of our argument, and at least part of the reason for thesuccess of the line-search MCMC algorithm in ecological applications, is that casesfor which likelihoods can be calculated with low computational expense are likely tobe rare, especially in the complex models that are the focus of most current research.For such systems, likelihoods can generally only be calculated through simulation(Beaumont et al. 2002; Ionides et al. 2006), and the inevitable error associated withthe likelihood-evaluation process is expected to yield line searches that end at good, butnot necessarily optimal, parameter values. The line-search MCMC algorithm exploitsthis error in the likelihood calculation, by using the not-quite-optimal parameter setsto approximate the surface topology near features of interest. As we show in Fig. 1,the output from our line searches does indeed provide a reasonable approximation tothe joint posterior distribution of the model parameters.

It is important to note that our application of PCA differs from its traditional appli-cation in biology, in which PCA is used to reduce the dimensionality of complex datasets. In traditional applications, such a reduction is achieved by retaining only the firstfew principal components, reducing dimensionality at the cost of information loss. Inour approach in contrast, we retain all of the principal components, thereby avoidinginformation loss while still being able to run our MCMC chains in uncorrelated para-meter space. Nevertheless, the model of course does not use the principal componentsas parameters, and so at each proposed jump we also back-transformed the princi-pal components into the original mechanistic model parameters, which we then usedto compute the prior and the likelihood for each parameter set. Back-transformingour parameters before calculating the prior and the likelihood allowed us to avoidwhat would otherwise be rather complicated transformations of the prior distributionsand the likelihood distribution (Gilks and Roberts 1996). Back-transformation is alsonecessary to permit biological interpretation of the parameters. We note that backtransformation can be computationally costly if a model has many parameters, butmost ecological models in practice have relatively few parameters (Bolker 2008).

It is also important to point out that our Metropolis–Hastings MCMC algorithmuses an approximate likelihood, and so it will not converge exactly on the “true”posterior distribution that would be achieved with the exact likelihood, because ofuncertainty in the estimate of the likelihood. This uncertainty may cause proposedparameter sets with low true likelihoods to be accepted at inflated rates, and proposed

123

Environ Ecol Stat

0.0497 0.0498 0.0499 0.0500 0.0501 0.0502

6.7

6.8

6.9

7.0

7.1

7.2

7.3

α

γ

Fig. 1 Relationship between joint posterior distribution and line-search outputs for the linear birth–deathmodel. The black contour lines show the joint posterior density, while the red contour lines show the roughmap of the posterior surface based on the line search step of our algorithm. The similarity between the twosuggests that a surface from the line-search results can provide a reasonable approximation to the posteriordistribution features, and that it can therefore inform the construction of proposal distributions (Color figureonline)

parameter sets with high true likelihoods to be accepted at deflated rates. The per-formance of any algorithm will be improved with more realizations per likelihoodcalculation, so the important question is, to what extent does our algorithm bias theposterior distribution? Most stochastic ecological models are so complicated that theycan only be simulated (Hartig et al. 2011), but in this case we were able to calculatelikelihoods precisely using Shortley (1965)’s Bessel function and approximately usingsampling-importance-resampling (Appendix 1), and so we were able to directly testthe performance of our algorithm.

To finalize the inference, we ran 5 Metropolis–Hastings MCMC chains, using theproposal distribution derived from the line search routine described above. Each chainwas run on a single processor, and the 5 chains were run in parallel. Convergence,meant in the traditional sense that diagnostic tests were satisfied, was assessed usingthe ‘coda’ package (Plummer et al. 2009) of the R statistical language (DevelopmentCore Team 2009). First, we used the Cramer-von-Mises statistic (Heidelberger andWelch 1983) to test stationarity for any parameter of the MCMC chains, and wefound that stationarity could not be rejected. Second, we used the Gelman and Rubindiagnostic (Gelman and Rubin 1992) to test whether the between-chain variances

123

Environ Ecol Stat

Fig. 2 Posterior plots for thelinear birth–death model. Thesimilarity between the posteriormarginal densities from differentMCMC chains (overlappinglines) suggests that theline-search MCMC algorithmhas converged

0.0490 0.0500 0.0510

0

1000

2000

3000

α

Den

sity

6.0 6.5 7.0 7.5 8.0

0.0

0.5

1.0

1.5

2.0

2.5

γ

0 2000 4000 6000 8000

0.04

960.

0498

0.05

000.

0502

Iterations

0 2000 4000 6000 8000

6.6

6.8

7.0

7.2

7.4

7.6

Iterations

γ

Fig. 3 Traceplots for the linear birth–death model. Trace plots for both model parameters suggests that thealgorithm is sampling from the stationary distribution

were close to the within-chain variances, which would indicate convergence. For thisdiagnostic, values of the summary statistic R close to 1 indicate convergence, whilevalues substantially greater than 1 indicate otherwise. We estimated R ≈ 1.01 for allparameters, suggesting that each of the chains had likely converged. Third, we visuallycompared the density plots of each parameter across the Markov chains, which didnot reveal any major differences (Fig. 2).

Given these preliminary test results, we next tested for convergence of the com-bined chains, in two ways. First, we used the half-width test of Heidelberger and Welch(Heidelberger and Welch 1983) to test whether we had sufficient accuracy in our esti-mate of the mean of each parameter. For a halfwidth to log-mean ratio of 0.1, this test

123

Environ Ecol Stat

Table 1 Mixing metrics

Metric Algorithm α γ

Effective size LS-MCMC [3663, 4451] [3882, 3935]

AM [1835, 1900] [1703, 1912]

Dependence factor (I) LS-MCMC [1.53, 1.73] [1.57, 1.87]

AM [3.00, 3.93] [3.10, 3.86]

Half-width LS-MCMC [3.18 × 10−6, 4.47 × 10−6] [4.14 × 10−3, 4.75 × 10−3]

AM [9.02 × 10−6, 1.27 × 10−5] [1.44 × 10−2, 1.88 × 10−2]

Autocorrelation LS-MCMC [0.425, 0.450] [0.421, 0.441]

AM [0.678, 0.690] [0.688, 0.709]

Mixing metrics comparing the line-search MCMC algorithm to the AM algorithm. Bracketed numbers arethe lowest and highest values of the metric resulting from 5 MCMC chains for each respective algorithm.For all metrics examined, line-search MCMC outperforms AM

suggested that convergence was achieved for each parameter. Second, we examinedthe trace plots from our chains (Fig. 3), which yielded a similar conclusion. Diag-nostic tests for convergence were thus satisfied in 105 steps. Passing diagnostic testsis not equivalent to proving convergence (Cowles and Carlin 1996), but nonetheless,achieving apparent convergence or near convergence in this relatively small numberof iterations is a direct benefit of proposing parameter jumps in principal componentspace.

To emphasize this point, we compared the line-search MCMC algorithm to thecommonly used AM algorithm (Haario et al. 2001) by running 5 MCMC chains ofeach algorithm for 2 × 104 MCMC steps. The line-search MCMC algorithm outper-formed the AM algorithm for every mixing-efficiency metric examined. These metricsincluded the effective size of each chain, as measured by the number of independentparameter draws in each chain, the dependence factor, as measured by the related-ness between successive points due to autocorrelation, the amount of uncertainty inthe estimate of the mean, as measured by the half-width of the confidence intervalon the posterior means, and the chain autocorrelation, as measured by the first-orderautocorrelation between MCMC steps (Table 1) (Plummer et al. 2009). For this prob-lem, the line-search MCMC algorithm thus seems to result in better mixing than theAM algorithm. This result supports our argument that the output of a large numberof parameter line searches can provide a reasonable and useful approximation to thejoint posterior surface.

To quantify the bias introduced by the approximate likelihood calculation, we com-pared the posterior distribution produced using the appropriate Bessel function to cal-culate the likelihood, which is thus effectively the true likelihood, to the posteriorproduced by calculating the likelihood using the sampling-importance-resamplingalgorithm. We then compared the bias introduced by the line-search MCMC algo-rithm to that produced by the AM algorithm. Both the line-search MCMC algorithmand the AM algorithm produced estimates of the posterior that are, as expected, flat-tened compared to the true posterior. The line-search MCMC estimate, however, is

123

Environ Ecol Stat

Fig. 4 Comparisons ofstationary distributionsrespectively for line-searchMCMC (dotted blue), AM(dashed red), and the trueposterior distribution (solidblack). Neither algorithmexactly reproduces the trueposterior, because of error in theestimation of the likelihood, butthe line-search MCMCalgorithm is much less biased(Color figure online)

0.0490 0.0500 0.0510

0

1000

2000

3000

4000

5000

α

Den

sity

6.0 6.5 7.0 7.5 8.0 8.5

0

1

2

3

4

γ

much closer to the true posterior than is the AM estimate (Fig. 4). This most likelyoccurred because the proposal distribution for line-search MCMC is more similar tothe true posterior distribution than is the proposal distribution for AM.

4 Species interaction model

4.1 Model structure

Given the success of our algorithm as applied to the linear birth–death model, weadditionally tested it on a simple model of species interactions, which we fit to data onbaculovirus growth inside insect hosts (Kennedy et al. 2014). In within-host pathogengrowth, the immune system can be viewed as analogous to a predator, and the pathogencan be viewed as analogous to a prey, leading to nonlinearities that are likely to havestrong effects on pathogen population growth (King et al. 2009). For baculoviruses inparticular, pathogen death inside a host occurs when virions are bound by immune cells(McNeil et al. 2010; Schmid-Hempel 2005), leading to encapsulation and removal ofboth the immune cell and the pathogen (Ashida and Brey 1998; Trudeau et al. 2001).More generally, species interaction models are an important tool for understandingecological data, and we therefore considered one such model, to further demonstratethe usefulness of our algorithm. The transition probabilities in the model are:

P(xt+�t = xt + 1, yt+�t = yt |xt , yt ) = φxt�t + o(�t), (9)

P(xt+�t = xt − 1, yt+�t = yt − 1|xt , yt ) = βxt yt�t + o(�t), (10)

P(xt+�t = xt , yt+�t = yt |xt , yt ) = 1 − φxt�t − βxt yt�t + o(�t). (11)

Here xt is the prey population at time t , yt is the predator population, φ is the preyreproductive rate, and β is the predation rate. Note that an important complication

123

Environ Ecol Stat

is that here predation events lead to the death of the predator, whereas in standardpredator-prey models predation events typically lead to predator reproduction (Kot2001). We note that the details of species interactions in general are usually so complexthat any model is necessarily limited in application, and in model fitting in particulara standard recommendation is to simply consider a reasonable nonlinearity (Turchin2003). We therefore argue that our model provides a reasonable example of a nonlinearspecies interaction model, with the advantage that it is a realistic description of thedynamics of baculovirus growth inside insects (Kennedy et al. 2014).

For baculoviruses in particular, an important complication is that the process ofvirus establishment is likely a saturating function of initial virus population size (vanBeek et al. 1988), and we model it as such. As in the birth–death model, we againinclude two absorbing boundaries, one at zero that represents virus extinction, and oneat an upper threshold that corresponds to the death of an infected host. We note thatfor different systems, this upper threshold might instead correspond to the cessationof direct protection of an endangered species, or the economic threshold for controlof an agricultural pest. The boundary conditions for the nonlinear model are:

x0 ∼ Poisson

(c1 D

c2 + D

)

, (12)

y0 = m, (13)

xtkill = N . (14)

Here, D is the applied virus dose, c1 is the upper limit for the Poisson parameterdescribing the number of organisms that initially establish the virus population, c2 isthe half-saturation constant for establishment of the virus and N is again the upperthreshold. As with the birth–death model, if xt = 0, the virus population goes extinct.We have also made the simplifying assumption that the initial population of the immunecells, y0, is equal to a constant value m for every population trajectory.

4.1.1 Fitting routine

Because there is no expression for the distribution of first passage times across theupper absorbing boundary, we were forced to approximate the likelihood using sim-ulations. Because high accuracy is useful for the MCMC phase of our algorithm butnot the line search phase, we used 300 model realizations during the line search phase,and 3000 model realizations during the MCMC phase. Simulating entire trajectoriesusing the Gillespie algorithm (Doob 1945; Gillespie 1977) would be computationallyprohibitive. For large enough virus populations, however, population growth in themodel is effectively deterministic. We therefore generated realizations using a two-step hybrid algorithm, in which we used the Gillespie algorithm until the number ofvirus particles xt reached 104, after which we used a deterministic, exponential-growthmodel to approximate the remaining time until the upper threshold was reached. If wedefine t ′ as the time at which the virus population first reaches 104, then the remainingtime until death t ′′ is:

t ′′ = 1

φlog

(N

104

)

. (15)

123

Environ Ecol Stat

Here, φ is the virus reproduction rate, and N is the upper population threshold. Thesimulated time to death is then the sum of t ′ and t ′′.

Our calculation of the likelihood is again based on the multinomial approximation,as in the case of the linear birth–death model:

L(β, φ, c1, c2, m|d1, d2, . . . , dK ) = n!d1!d2! . . . dK ! pd1

1 pd22 . . . pdk

K . (16)

Here β, φ, c1, c2, and m are the parameters in our stochastic species interaction model,di is the number of trajectories in time bin i , n is the total number of trajectories, andpi is the probability that the trajectory falls in time bin i , as calculated from Eq. (6).We then used line search and MCMC as described for the linear birth–death model,

β

0.88

−0.9

−0.7

−0.5

φ

0.41

1.2

1.6

2

c1

0.71

0.56

2.6

3

3.4

c2

0.25 0

0.99

−6.5 −5.5 −4.5

β

3

4

5

6

m

0.92

−0.9 −0.7 −0.5

φ

0.49

1.2 1.6 2.0

c1

0.51

2.6 3.0 3.4

c2

3.0 4.0 5.0 6.0

m

Fig. 5 Parameter values from the top 50 line-search results. Pairwise comparisons of the log10 values foreach parameter, with histograms of the marginal parameter estimates from the line-search results at the topof each column. The grey lines in the plots are regression lines, with the correlation coefficient for eachparameter pair listed in the corner. Note that there are strong log-linear relationships between many of theparameters

123

Environ Ecol Stat

−6 −2 0 2 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30

PC1

−2 −1 2

0.0

0.2

0.4

0.6

PC2

−2 −1 00 11 2

0.0

0.2

0.4

0.6

0.8

PC3

−0.4 0.0 0.4

01

23

PC4

−0.02 0.00 0.02

020

4060

80

PC5

−7 −6 −5 −4

0.0

0.2

0.4

0.6

β−1.0 −0.6 −0.2

01

23

4

φ1.0 1.4 1.8 2.2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

c1

2.6 3.0 3.4

01

23

c2

3 4 5 6 7

0.0

0.1

0.2

0.3

0.4

0.5

0.6

m

Fig. 6 Density plots of principal components (above) and log10 parameters (below). Density plots aregenerally similar between MCMC chains, suggesting that the chains are sampling from the same stationarydistribution, and can thus be combined together to improve our estimate of the posterior distribution

except that in this case, during the line-search phase, we carried out 4000 line searchesinstead of 1582. The results of our line searches are presented in Fig. 5.

To emphasize the strength of inference that we can achieve with this type ofdata, we used vague priors in the MCMC phase of the algorithm, such that theprior for each parameter was uniform on the interval (0, ∞). To show that theseimproper priors did not restrict our conclusions, we assessed the sensitivity of ourposteriors to the priors by re-running our MCMC chains using vague but proper pri-ors. This analysis showed that there was very little effect of changes in the priors(see Appendix 2).

We ran 7 Metropolis–Hastings MCMC chains, such that each chain started fromdifferent initial conditions. In practice, each of these chains used message passingto divide the simulations of the model trajectories over 50 computing cores, but weemphasize that this complication is not required. The chains are independent and there-fore can be run on separate nodes without any message passing. As with the chains

123

Environ Ecol Stat

0 10000 30000

PC1

0 10000 30000

PC2

0 10000 30000

PC3

0 10000 30000

PC4

0 10000 30000

PC5

0 10000 30000

β φ c1 c2

0 10000 30000 0 10000 30000 0 10000 30000 0 10000 30000

−4−2

02

−2−1

01

−1.

5−

0.5

0.5

1.5

−0.

20.

00.

20.

4

−0.

020.

000.

01

−7.

0−

6.0

−5.

0−

4.0

−0.

9−

0.7

−0.

5−

0.3

1.2

1.4

1.6

1.8

2.0

2.6

2.8

3.0

3.2

3.4

34

56

7

m

Fig. 7 Trace plots of principal components (above) and log10 parameters (below). The chain appears tobe mixing well, with only occasional periods of poor mixing

in the linear birth–death model, we used several diagnostics to test for apparent con-vergence. The Cramer-von-Mises statistic (Heidelberger and Welch 1983) suggestedthat we could not reject stationarity (p > 0.05), the Gelman and Rubin diagnostic(Gelman and Rubin 1992) showed that between-chain variances were very close towithin-chain variances (R < 1.02 for all parameters), visual comparison of the den-sity plots between the Markov chains did not reveal any major differences betweenchains (Fig. 6), and using a halfwidth to log-mean ratio of 0.1 in the half-width testagain confirmed that convergence had occurred for each parameter. Examination of thetrace plots (Fig. 7) supports the notion that our chains are mixing, and in comparisonto the trace plots from a traditional independence sampler MCMC, mixing was greatlyimproved. Diagnostic tests for convergence were thus satisfied after only 3.2 × 105

steps.To explore the fit of our model, 104 parameter sets were generated from the joint

posterior distribution of the parameters, and used to simulate times between infectionand death. Visual comparison of the model predictions to the baculovirus data showthat the model fits the data reasonably well (Fig. 9). For the nonlinear model we cannotdirectly estimate the bias in the posterior, but a comparison of posterior distributionsestimated using different numbers of realizations showed that the posterior stoppedchanging when the number of realizations in the likelihood calculation reached 3×103

(see Appendix 3).In Appendix 4 we discuss the implications of our parameters for the biology of

baculoviruses. One point, however, is relevant both to baculovirus biology and to thegeneral usefulness of our algorithm in cases with strong correlations between parame-ter estimates. Specifically, we observed a strong correlation between our estimates ofthe the initial immune cell population size m, and the immune cell attack rate parame-ter β. While the resulting uncertainty is not unexpected, it limits the inferences thatwe can make. One way to reduce this uncertainty would be to construct informativepriors on these parameters using other data, whereas here we have used only vaguepriors. We note that line-search MCMC could easily be extended to allow for the case

123

Environ Ecol Stat

Fig. 8 Pairwise contour plots ofthe marginal posterior densitiesfor the log of the modelparameters. Strong correlationsare present for multipleparameters, in particular,m and β

φ

c1

c2

β

m

φ c1 c2

of informative priors, simply by weighting the likelihood by the informative priordensity values in the line search step and searching for the maximum posterior ratherthan the maximum likelihood.

5 Discussion

Here we presented a model fitting algorithm that we expect to be useful for infer-ence with a broad range of ecological models. The algorithm combines parameterline searches, PCA, and MCMC. Because the line searches can be run indepen-dently of each other, sophisticated parallel computing methods are not required torun the algorithm on massively parallel computing environments. For the ecologi-cal model considered here, despite being nonadaptive and easy to implement, line-search MCMC appeared to have better mixing and to have introduced less bias thanthe popular AM algorithm of Haario et al. (2001). The appeal of our method restsin its ease of implementation, and in its general usefulness for ecological model-ing.

Khorsheed et al. (2011) proposed a related adaptive-MCMC algorithm that alsouses PCA to carry out an automated transformation of model parameters, but theiralgorithm only uses PCA on the accepted parameter sets from the beginning of anMCMC chain. The principal components are then used as future proposals, followingmore traditional adaptive-MCMC algorithms. Such an approach similarly appears toimprove mixing and apparent convergence times, but as with many adaptive-MCMCalgorithms (Rosenthal 2000), each proposal distribution is tuned within the MCMCchain itself, and so running multiple MCMC chains requires multiple tuning stages.Accordingly, there is no obvious way to adapt the Khorsheed et al. algorithm to highlyparallel computing environments that have large numbers of processors. Line-searchMCMC is contrast tunes the proposal distribution before the MCMC step begins. Theresult is that the tuning phase can be easily implemented in parallel, and the resultingproposal distribution can be used to run multiple MCMC chains. This feature makes

123

Environ Ecol Stat

D=13500

n=115

0

5

10

15

20

25

30

35C

ount

D=6750

n=115

D=3375

n=115

0 100 200 300 400 500 600

D=844

n=230

0

5

10

15

20

25

30

35

Cou

nt

0 100 200 300 400 500 600

D=10.4

n=461

0 100 200 300 400 500 600

Time to death (hours)

Fig. 9 The fit of the model to the data. To qualitatively assess the fit of the model to the data, 104 parametersets were generated from the joint posterior distribution of the parameters, and used to simulate speed-of-kill. The median values for each time are plotted in red as a solid line. The gray shaded region is the regionbetween the pointwise 1st percentile and 99th percentile (red dotted lines) at each time. The overlayingblack squares are the data collected during the dose-response study. The figure also shows the virus dosereceived (D) and the number of larvae infected (n) (Color figure online)

it easy to implement line-search MCMC in parallel environments without requiringmessage passing.

Our line-search MCMC algorithm nevertheless has an important limitation in thatthe implementation of the PCA step relies on assumptions of independence and nor-mality for each principal component, resulting in proposal distributions which areunimodal, multivariate normal distributions. Nevertheless, even though the valuesreturned by the PCA step are only truly independent if the original parameters are nor-mally distributed, the parameters will still be less correlated even when non-normal(Jolliffe 1986), and so mixing is still expected to be better for line-search MCMC than

123

Environ Ecol Stat

for other MCMC-type algorithms. Moreover, additional transformations to achievenormality could easily be employed in such situations.

Line-search MCMC could also be extended so that the restrictions on the pro-posal distribution are eliminated, allowing the algorithm to be used efficiently evenwith models that have highly non-normal posteriors. In particular, the assumptionthat the posterior is approximately normal could be relaxed by using the mean andvariance from the PCA output in some other proposal distribution that approxi-mates the posterior more closely. Likewise, the assumption of a linear relationshipbetween model parameters could be relaxed by using kernel PCA to extract princi-pal components from the line-search results instead of standard PCA (Schölkopf etal. 1998). An independent component analysis might be useful even when the para-meter distributions are non-Gaussian (Comon 1994). Lastly, nonlinear multidimen-sional scaling (NMDS) has recently been suggested as a way to explore tree distancespace with MCMC (Chakerian and Holmes 2012). These possible extensions sug-gest that the line-search algorithm may have broader uses as an automated parallel-environment algorithm for problems other than those we considered. For example,Bayesian methods can be used to find maximum likelihood estimates (Lele et al.2007, 2010) or to assess parameter identifiability (Ponciano et al. 2012). Neverthe-less, PCA transformation and back-transformation can be computationally expensivefor models with many parameters, and in such cases other approaches may be moresuitable.

Another potential drawback is that a multi-modal posterior could cause line-searchMCMC to miss the modes and therefore do a poor job of constructing a proposal dis-tribution, resulting in a very low acceptance rate. For such cases, parallel temperingor other methods of exploring the local geometry of the distribution might be moresuitable (Girolami and Calderhead 2011). We note, however, that the use of an inde-pendence MCMC sampler is not required to take full advantage of the line search partof our algorithm. That is, one could instead combine line search with a more complexproposal algorithm in the MCMC phase, for example by adding a random walk ele-ment to complement the independence sampler. Such extensions, however, would addcomplexity to the implementation of the algorithm.

Approximate Bayesian computation (ABC) (Csillery et al. 2010) is another viablesolution for the problem of complex posterior distributions, especially in its sequentialMonte Carlo (SMC) flavor (Liu 2001; Dukic et al. 2012). In many ecological appli-cations, however, ABC introduces new difficulties. First, many applications requiremodel selection, for which ABC is often unsuitable (Robert et al. 2011). Second, ABCtypically requires the simulation of an entire dataset for each proposed parameter vec-tor, which often requires non-trivial computational resources.

An additional challenge of the line-search MCMC algorithm is that the user mustspecify the range of parameter values for the parameter line search step, which mayrequire some trial and error. When informative priors are being used, these priors couldbe used to guide the parameter range of the line searches. Nevertheless, in some cases,line search may produce poorly designed proposal distributions, and we thereforesuggest three straightforward diagnostics to assess if this might have happened. First,histograms of the line-search results could be used to examine if the best parametersets tend to be on the edge of a parameter’s range. This would indicate that the para-

123

Environ Ecol Stat

meter range in the line searches should be adjusted. These histograms could also beused to explore whether the posterior might be multimodal, which could lead to poormixing. Second, non-independence of the principal components due to the non-normalposterior, can be detected by plotting pairwise combinations of the PCA-transformedline-search results. Third, as with any MCMC-based algorithm, poorly designed pro-posals will lead to acceptance rates far below the optimal rate of 0.234 (Roberts etal. 1997), and thus poor mixing. A failure on any of these tests, however, does notnecessarily mean that line-search MCMC will be ineffective. Indeed, the principalcomponents of the nonlinear model presented here were at least weakly correlated,but the algorithm was still useful.

Acknowledgments DAK was supported by an ARCS fellowship, a GAANN training grant while at theUniversity of Chicago, and the RAPIDD program of the Science and Technology Directorate, Departmentof Homeland Security and Fogarty International Center, National Institutes of Health (NIH). GD and VDwere supported by NIH Grant R01GM096655. VD was also supported by Grants NSF-DEB 1316334 andNSF-GEO 1211668. We thank two anonymous reviewers for comments that substantially improved themanuscript.

Appendix 1: Sampling–importance–resampling

Directly simulating many realizations of a birth–death process is computationallyexpensive. To avoid this cost for the linear birth–death model, we instead sampledirectly from the distribution of first passage times, using a sampling-importance-resampling algorithm. This method is possible because the function that describes thefirst passage time for a linear birth–death model can be evaluated point-wise (Shortley1965).

We began our algorithm, for a given parameter set, by first numerically inte-grating the first-passage-time function in the C programming language, using the‘gsl_integration_qag’ function from the GNU Scientific Library, over the range[0, 612], matching the range of observation times in our experiment. Because thelinear birth–death model has an absorbing boundary if the population size hits zero,not every trajectory will cross the upper threshold that leads to host death, and so theintegral of this function will be in the range [0, 1], with the integral value, pd , corre-sponding to the probability of host death occurring by hour 612. For a given numberof model trajectories ν, the number of host deaths is then a number drawn from abinomial distribution with parameters ν and pd .

To generate these first passage times from our target distribution we used a sampling-importance-resampling algorithm. First, we generated 104 potential first passage times,ui , from a uniform distribution on the interval [0, 612]. This interval was chosen sothat these points would span the range of our data. Second, we calculated weightsW (ui ) for each of these time points, using the density function for first passage timeQd(·) proposed by Shortley (1965). Weights were thus calculated as

W (ui ) = Qd(ui )/argmaxui

(Qd(ui )). (17)

123

Environ Ecol Stat

Third, we generated the first passage times from our target distribution by resamplingui according to the respective weights W (ui ).

Appendix 2: Sensitivity to priors

In the main text of the paper, we ran our MCMC routine using improper priors, whichcan sometimes lead to improper posterior distributions. We believe that an improperposterior is unlikely in our case, because each of our multiple MCMC chains seemto have converged on the same stationary distribution. As an additional test, however,we further examined the model behavior under different sets of priors. To do this, we

−7 −6 −5 −4

0.0.

0.4

0.6

β

Den

sity

−0.8 −0.6 −0.4 −0.2

φ1.0 1.2 1.4 1.6 1.8 2.0 2.2

c1

2.6 2.8 3.0 3.2 3.4 3.6

c2

Den

sity

3 4 5 6 7

02

01

23

4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

m

Fig. 10 Sensitivity of posterior to priors. Posterior distributions for log10 values of each parameter areshown above. The black line shows the posterior distribution using the improper priors from the main text.The red line shows the posterior distribution when using the vague proper priors in Eq. (18). The blue lineshows the posterior distribution when using the parameter-specific variances given in the text. We achievevery similar posterior distributions using each sets of priors (Color figure online)

123

Environ Ecol Stat

re-ran our analysis using half-normal priors that are vague but proper. Thus for eachparameter

π(θ) =⎧⎨

⎩

2√2π×1014 e

− θ2

2×1014 if θ ≥ 0;0 otherwise.

(18)

where θ is defined as β, φ, c1, c2, or m.We also re-ran this analysis using a set of half-normal priors similar to the above,

but with parameter-specific variances for the priors of β, φ, c1, c2, and m of 100, 100,104, 106, and 1014 respectively.

From this analysis, we observe that the resulting marginal posteriors are very similarto those achieved in our earlier analysis using improper priors (Fig. 10), providingadditional evidence that the data are informative about the model parameters, and thatthe results are robust to the choice of priors.

Appendix 3: Bias in posterior estimates

Although the MCMC routine used in this paper appears to converge to a stationarydistribution, the distribution is not exactly equal to the posterior distribution. Proposedparameter sets can be accepted at an inflated rate, because of uncertainty in our estimateof the likelihood, and our MCMC chains tend to over-accept proposed jumps. Theresult of this over-acceptance is a stationary distribution that is biased towards theproposal distribution.

The uncertainty in our estimate of the likelihood depends on the number of real-izations used to parameterize Eq. (6), and so an obvious way to eliminate this biaswould be to increase the number of realizations. Our precision is thus directly relatedto computing time, and so in the face of limited computing resources, we are forcedto allow for at least some bias. At the number of realizations we used (3 × 103),however, the bias in our realized posterior distribution is minimal. To show this, were-ran our analyses for a range of numbers of realizations. As Fig. 11 demonstrates,increasing the number of realizations at first leads to dramatic changes in the posterior,but as we approach 3 × 103 realizations, further increases have essentially no effect.This suggests that our stationary distribution is probably close to the true posteriordistribution.

Appendix 4: Implications of the results for the nonlinear dynamical model

Our estimates of the parameters of the nonlinear model are listed in Table 2, but herewe place these estimates in the context of baculovirus biology. First, our doubling-time estimate of 3.04 h is similar to a doubling-time estimate of 2.53 h for the cabbagelooper Trichoplusia ni calculated using DNA-DNA hybridization (Beek et al. 1990).We did not necessarily expect close agreement between these estimates because the

123

Environ Ecol Stat

β

Den

sity

φ c1

c2

Den

sity

−7 −6 −5 −4 −3

0.0

0.2

0.4

0.6

−1.0 −0.8 −0.6 −0.4 −0.2

01

23

4

1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

2.5 3.0 3.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

2 3 4 5 6 7

0.0

0.1

0.2

0.3

0.4

0.5

0.6

m

Realizations = 10

300

1000

2000

2500

3000

Fig. 11 Bias in the realized posterior distribution. Posterior distributions for log10 values of each parameterare shown above. Each line shows the results when using a particular number of realizations to estimate thelikelihood, with the solid black line showing the posterior distribution at the number of realizations usedin the paper. By comparing the change in the posterior distribution at different numbers of realizations,we get a sense of how bias is affecting our results. While bias seems to be important at low numbers ofrealizations, at higher numbers the distributions seem to stabilize, suggesting that little bias remains at3 × 103 realizations

Table 2 Parameter estimates

The lower, median, and upperbounds on the credible intervalsof the parameters

Parameter 2.5 % 50 % 97.5 %

β 2.07 × 10−7 2.86 × 10−6 2.79 × 10−5

φ 0.1517 0.2279 0.3884

c1 20.01369 35.04416 64.64295

c2 591.2821 1,085.797 2,078.756

m 5,043 72,920 1,636,990

two insects and their associated baculoviruses are not closely related, but the roughsimilarity suggests that our estimate is biologically reasonable.

123

Environ Ecol Stat

Second, our estimate of the half-saturation constant of the virus-dose function isc2 ≈ 103, which is much lower than the 2 × 109 virus particles that are produced by avirus-killed, fourth-instar gypsy-moth cadaver (Shapiro et al. 1986). It thus appears thatvirus doses in nature are nearly saturated, so that small changes in dose have little effecton host times of death. This is surprising because virus strains could presumably killfaster if they produced fewer virus particles. We would therefore expect that naturalselection would favor virus strains with shorter speeds of kill, because the cost ofproducing fewer virus particles appears to be very low. In nature, however, the virusis rapidly rendered inactive by ultraviolet light (Fuller et al. 2012), and so consumeddoses of infectious virus may often be quite small. The slow speed of kill of this virusmay therefore be an adaptation to high virus-inactivation rates, because slow-killingvirus strains produce large numbers of particles that help to reduce the risk that allparticles will be inactivated (Shapiro et al. 2002). Our estimate of c2 therefore suggeststhat selective forces acting within hosts may oppose selective forces acting betweenhosts, as has often been suggested by mathematical theories of pathogen evolution(Antia et al. 1994; Gilchrist and Sasaki 2002).

Our best estimate of the largest average number of virus particles that could initiatean infection is c1 ≈ 35. Given that the highest virus dose used was 1.35×104 particles,our estimate of c1 suggests that the vast majority of consumed virus particles play norole in infection, even though larvae almost certainly have many more than 35 midgutepithelial cells (Baldwin and Hakim 1991). This observation can be explained bycell sloughing, in which cells of the larval midgut are removed and subsequentlyreplaced by new cells (Baldwin and Hakim 1991). Our estimate of c1 thus supportsprevious research suggesting that cell sloughing is an important line of defense againstbaculovirus infection (McNeil et al. 2010; Hoover et al. 2000). Our estimate of c1 alsoimplies that severe population bottlenecks occur at the beginning of each new infection,in turn suggesting that genetic drift may be an important evolutionary force shapingthe virus population.

Our estimate of the number of immune cells in a healthy larva is m = 7 × 104.Examination of the posterior distribution revealed that this estimate is actually highlyuncertain, because of a strong, negative log-linear correlation with the immune-cellattack rate β (Fig. 8). The strong correlation between these two parameters might beexpected given that in the deterministic version of the model, these two parametersare individually non-identifiable.

Mechanistic models of within-host pathogen growth have a long history (Antia etal. 1994; Alizon and Baalen 2008; Shortley 1965), but few of these models have beenchallenged with data, because of the computational difficulties associated with fittingnonlinear, dynamic models. Although fitting static or deterministic models to responsedata has provided useful insights into the infection process of some pathogens (Meynell1957), including baculoviruses (Beek et al. 2000; Zwart et al. 2009), a growing litera-ture strongly suggests that within-host pathogen population growth is stochastic (Grantet al. 2008; Kennedy et al. 2014; Vaughan et al. 2012). Incorporating this stochasticityand using the entire distribution of speeds of kill to make inference is superior tobasing the inference simply on the mean quantities. Our work therefore demonstratesthe usefulness of nonlinear stochastic models in understanding within-host pathogengrowth. Moreover, nonlinear dynamic models are becoming increasingly popular in

123

Environ Ecol Stat

ecology, highlighting the need for easy-to-implement statistical algorithms suitablefor use with such models.

For baculoviruses in particular, survival-time data are widely available, but are usu-ally used only to calibrate parametric phenomenological models such as those basedon the Weibull distribution (Mudholkar et al. 1996; Morgan 1992). By instead usingspeed-of-kill data to fit a more mechanistic model, we have gained useful insights intothe underlying biological processes, which in turn has allowed us to make inferencesabout virus evolution. In particular, our results suggest that genetic drift likely playsan important role in the evolution of the virus, which is important partly because driftmay oppose the effects of natural selection (Kimura 1983). The occurrence of driftalso has implications for the use of baculoviruses in pest control, because control pro-grams often use only a single strain of virus (Hunter-Fujita et al. 1998). This has ledto concerns that virus sprays will reduce natural diversity, and our results suggest thatsuch reductions may be exacerbated by the drift inherent in the infection process.

References

Alizon S, van Baalen M (2008) Acute or chronic? Within-host models with immune dynamics, infectionoutcome, and parasite evolution. Am Nat 172:E244–E256

Antia R, Levin B, May R (1994) Within-host population-dynamics and the evolution and maintenance ofmicroparasite virulence. Am Nat 144:457–472

Armenian H, Lilienfeld A (1983) Incubation period of disease. Epidemiol Rev 5:1–15Ashida M, Brey P (1998) Molecular mechanisms of immune responses in insects. Chapman & Hall, LondonBaldwin K, Hakim R (1991) Growth and differentiation of the larval midgut epithelium during molting in

the moth, Manduca sexta. Tissue Cell 23:411–422Beaumont M, Zhang W, Balding D (2002) Approximate Bayesian computation in population genetics.

Genetics 162:2025–2035Bogich T, Shea K (2008) A state-dependent model for the optimal management of an invasive metapopu-

lation. Ecol Appl 18:748–761Bolker B (2008) Ecological models and data in R. Princeton University Press, New JerseyBraun M (1983) Differential equations and their applications, an introduction to applied mathematics, 3rd

edn. Springer, New YorkBrigham C, Power A, Hunter A (2002) Evaluating the internal consistency of recovery plans for federally

endangered species. Ecol Appl 12:648–654Brockwell A (2006) Parallel Markov chain Monte Carlo simulation by pre-fetching. J Comput Graph Stat

15:246–261Chakerian J, Holmes S (2012) Computational tools for evaluating phylogenetic and hierarchical clustering

trees. J Comput Graph Stat 21:581–599Comon P (1994) Independent component analysis, a new concept. Signal Proces 36:287–314Cory J, Myers J (2003) The ecology and evolution of insect baculoviruses. Annu Rev Ecol Evol Syst

34:239–272Cowles M, Carlin B (1996) Markov chain Monte Carlo convergence diagnostics: a comparative review. J

Am Stat Assoc 91:883–904Craiu R, Rosenthal J, Yang C (2009) Learn from thy neighbor: parallel-chain and regional adaptive MCMC.

J Am Stat Assoc 104:1454–1466Csillery K, Blum M, Gaggiotti O, Francois O (2010) Approximate Bayesian computation (ABC) in practice.

Trends Ecol Evol 25:410–418Doak DF, Morris WF (2010) Demographic compensation and tipping points in climate-induced range shifts.

Nature 467:959–962Doob J (1945) Markoff chains: denumerable case. Trans Am Math Soc 58:455–473Dukic V, Lopes H, Polson N (2012) Tracking epidemics with Google Flu trends data and a state-space SEIR

model. J Am Stat Assoc 107:1410–1426

123

Environ Ecol Stat

Feng H, Gould F, Huang Y, Jiang Y, Wu K (2010) Modeling the population dynamics of cotton bollwormHelicoverpa armigera (Hubner) (Lepidoptera: Noctuidae) over a wide area in northern China. EcolModel 221:1819–1830

Fuller E, Elderd B, Dwyer G (2012) Pathogen persistence in the environment and insect-baculovirus inter-actions: disease-density thresholds, epidemic burnout and insect outbreaks. Am Nat 179:E70–E96

Fuller S, Millet L (2011) Computing performance: Game over or next level? IEEE Comput 44:31–38Geer D (2005) Chip makers turn to multicore processors. IEEE Comput 38:11–13Gelman A, Rubin D (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–472Gilchrist M, Sasaki A (2002) Modeling host-parasite coevolution: a nested approach based on mechanistic

models. J Theor Biol 218:289–308Gilks W, Roberts G (1996) Markov chain Monte Carlo in practice, chapter Introducing Markov chain Monte

Carlo. Chapman & Hall, LondonGillespie D (1977) Exact stochastic simulation of coupled chemical-reactions. J Phys Chem 81:2340–2361Girolami M, Calderhead B (2011) Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J

R Stat Soc Ser B 73:123–214Grant A, Restif O, McKinley T, Sheppard M, Maskell D, Mastroeni P (2008) Modelling within-host spa-

tiotemporal dynamics of invasive bacterial disease. PLoS Biol 6:757–770Haario H, Saksman E, Tamminen J (2001) An adaptive Metropolis algorithm. Bernoulli 7:223–242Hartig F, Calabrese JM, Reineking B, Wiegand T, Huth A (2011) Statistical inference for stochastic simu-

lation models—theory and application. Ecol Lett 14:816–827Heidelberger P, Welch P (1983) Simulation run length control in the presence on an initial transient. Oper

Res 31:1109–1144Hoover K, Washburn J, Volkman L (2000) Midgut-based resistance of Heliothis virescens to baculovirus

infection mediated by phytochemicals in cotton. J Insect Physiol 46:999–1007Hunter-Fujita F, Entwistle P, Evans H, Crook N (1998) Insect viruses and pest management. Wiley, Chich-

esterIonides E, Breto C, King A (2006) Inference for nonlinear dynamical systems. Proc Natl Sci USA

103:18438–18443Jacob P, Robert C, Smith M (2011) Using parallel computation to improve independent Metropolis-Hastings

based estimation. J Comput Graph Stat 20:616–635Jolliffe I (1986) Principal component analysis. Springer, New YorkKarlin S, Taylor H (1975) A first course in stochastic processes. Academic, New YorkKennedy DA, Dukic V, Dwyer G (2014) The mechanisms determining the within-host population dynamics

of an insect pathogen. Am Nat 184:407–423Khorsheed E, Hurn M, Jennison C (2011) Mapping electron density in the ionosphere: a principal component

MCMC algorithm. Comput Stat Data Anal 55:338–352Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, New YorkKing A, Shrestha S, Harvill E, Bjørnstad O (2009) Evolution of acute infections and the invasion-persistence

trade-off. Am Nat 173:446–455Kot M (2001) Elements of mathematical ecology. Cambridge University Press, CambridgeLele S, Dennis B, Lutscher F (2007) Data cloning: easy maximum likelihood estimation for complex

ecological models using Bayesian Markov chain Monte Carlo methods. Ecol Lett 10:551–563Lele S, Nadeem K, Schmuland B (2010) Estimability and likelihood inference for generalized linear mixed

models using data cloning. J Am Stat Assoc 105:1617–1625Liu J (2001) Monte Carlo strategies in scientific computing. Springer, BerlinLuenberger D, Ye Y (2008) Linear and nonlinear programming, 3rd edn. Springer Science and Business

Media, New YorkMcNeil J, Cox-Foster D, Gardner M, Slavicek J, Thiem S, Hoover K (2010) Pathogenesis of Lymantria

dispar multiple nucleopolyhedrovirus (LdMNPV) in L. dispar and mechanisms of developmentalresistance. J Gen Virol 91:1590–1600

Meynell G (1957) The applicability of the hypothesis of independent action to fatal infections in mice givenSalmonella typhimurium by mouth. J Gen Microbiol 16:396–404

Miller G (2010) Markov chain Monte Carlo calculations allowing parallel processing using a variant of theMetropolis algorithm. Open Numer Methods J 2:12–17

Morgan B (1992) Analysis of quantal response data. Chapman & Hall, LondonMudholkar G, Srivastava D, Kollia G (1996) A generalization of the Weibull distribution with application

to the analysis of survival data. J Am Stat Assoc 91:1575–1583

123

Environ Ecol Stat

Plummer M, Best N, Cowles K, Vines K. (2009) coda: Output analysis and diagnostics for MCMC. Rpackage version 0.13-4

Ponciano J, Burleigh J, Braun E, Taper M (2012) Assessing parameter identifiability in phylogenetic modelsusing data cloning. Syst Biol 61:955–972

Press W, Teukolsky S, Vetterling W, Flannery B (1992) Numerical recipes in C. Cambridge UniversityPress, Cambridge

Development Core Team R (2009) R: A language and environment for statistical computing. R Foundationfor Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0

Robert C, Cornuet J, Marin J, Pillai N (2011) Lack of confidence in approximate Bayesian computationmodel choice. Proc Natl Acad Sci USA 108:15112–15117

Roberts G, Gelman A, Gilks W (1997) Weak convergence and optimal scaling of random walk Metropolisalgorithms. Ann Appl Probab 7:110–120

Rosenthal J (2000) Parallel computing and Monte Carlo algorithms. Far East J Theor Stat 4:207–236Saaty T (1961) Some stochastic-processes with absorbing barriers. J R Stat Soc Ser B Stat Methodol

23:319–334Schmid-Hempel P (2005) Evolutionary ecology of insect immune defenses. Annu Rev Entomol 50:529–551Schölkopf B, Smola A, Müller K (1998) Nonlinear component analysis as a kernel eigenvalue problem.

Neural Comput 10:1299–1319Shapiro M, Farrar R Jr, Domek J, Javaid I (2002) Effects of virus concentration and ultraviolet irradiation on

the activity of corn earworm and beet armyworm (Lepidoptera:Noctuidae) nucleopolyhedroviruses. JEcon Entomol 95:243–249

Shapiro M, Robertson J, Bell R (1986) Quantitative and qualitative differences in gypsy moth (Lepidoptera:Lymantriidae) nucleopolyhedrosis virus produced in different-aged larvae. J Econ Entomol 79:1174–1177

Shortley G (1965) A stochastic model for distributions of biological response times. Biometrics 21:562–582Solonen A, Ollinaho P, Laine M, Haario H, Tamminen J, Jarvinen H (2012) Efficient MCMC for climate

model parameter estimation: parallel adaptive chains and early rejection. Bayesian Anal 7:715–736Strid I (2010) Efficient parallelisation of Metropolis-Hastings algorithms using a prefetching approach.

Comput Stat Data Anal 54:2814–2835Trudeau D, Washburn J, Volkman L (2001) Central role of hemocytes in Autographa californica M nucle-

opolyhedrovirus pathogenesis in Heliothis virescens and Helicoverpa zea. J Virol 75:996–1003Turchin P (2003) Complex population dynamics: a theoretical/empirical synthesis. Princeton University

Press, Princetonvan Beek N, Flore P, Wood H, Hughes P (1990) Rate of increase of Autographa californica nuclear poly-

hedrosis virus in Trichoplusia ni larvae determined by DNA-DNA hybridization. J Invertebr Pathol55:85–92

van Beek N, Hughes P, Wood H (2000) Effects of incubation temperature on the dose-survival time relation-ship of Trichoplusia ni larvae infected with Autographa californica nucleopolyhedrovirus. J InvertebrPathol 76:185–190

van Beek N, Wood H, Hughes P (1988) Quantitative aspects of nuclear polyhedrosis virus infections inLepidopterous larvae: the dose-survival time relationship. J Invertebr Pathol 51:58–63

van den Berg S, Beem L, Boomsma D (2006) Fitting genetic models using Markov chain Monte Carloalgorithms with BUGS. Twin Res Hum Genet 9:334–342

Vaughan T, Drummond P, Drummond A (2012) Within-host demographic fluctuations and correlations inearly retroviral infection. J Theor Biol 295:86–99

Wilkinson D (2005) Handbook of Parallel computing and statistics, chapter parallel Bayesian computation.Dekker/CRC Press, New York

Yan J, Cowles M, Wang S, Armstrong M (2007) Parallelizing MCMC for Bayesian spatiotemporal geosta-tistical models. Stat Comput 17:323–335

Zwart M, Hemerik L, Cory J, de Visser J, Bianchi F, Van Oers M, Vlak J, Hoekstra R, Van der Werf W(2009) An experimental test of the independent action hypothesis in virus-insect pathosystems. ProcR Soc Lond Ser B-Biol Sci 276:2233–2242

David A. Kennedy is a disease ecologist at the Center for Infectious Disease Dynamics in The Pennsyl-vania State University. He is a Research and Policy in Infectious Disease Dynamics (RAPIDD) postdoc-toral fellow. He received his M.S. and Ph.D. from the University of Chicago.

123

Environ Ecol Stat

Vanja Dukic is an Associate Professor of Applied Mathematics at the University of Colorado at Boulder.She works in Bayesian and computational statistics, mostly in the areas of medicine, ecology, evolution,and risk assessment. She received a PhD in Applied Mathematics from Brown University in 2001. She wasa postdoctoral fellow and visiting professor in the Department of Statistics at the University of Chicagofrom 2000-2001. She was an Assistant Professor (2001-2008) and Associate Professor with tenure (2008-2010) of Biostatistics at the University of Chicago.

Greg Dwyer is an Associate Professor in the Ecology and Evolution department at the University ofChicago. He works on the ecology and evolution of infectious diseases, with a focus on how mathemati-cal models can aid in the understanding of disease dynamics. He received his Ph.D. from the Universityof Washington.

123

Combining principal component analysis with parameter line...

Documents

Transcript of Combining principal component analysis with parameter line...