Final

22
See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/7559669 Tutorial in biostatistics: spline smoothing with linear mixed models ARTICLE in STATISTICS IN MEDICINE · NOVEMBER 2005 Impact Factor: 2.04 · DOI: 10.1002/sim.2193 · Source: PubMed CITATIONS 25 DOWNLOADS 85 VIEWS 533 3 AUTHORS, INCLUDING: Martin L. Hazelton Massey University 77 PUBLICATIONS 1,094 CITATIONS SEE PROFILE Available from: Martin L. Hazelton Retrieved on: 17 August 2015

description

gh

Transcript of Final

Page 1: Final

Seediscussions,stats,andauthorprofilesforthispublicationat:http://www.researchgate.net/publication/7559669

Tutorialinbiostatistics:splinesmoothingwithlinearmixedmodels

ARTICLEinSTATISTICSINMEDICINE·NOVEMBER2005

ImpactFactor:2.04·DOI:10.1002/sim.2193·Source:PubMed

CITATIONS

25

DOWNLOADS

85

VIEWS

533

3AUTHORS,INCLUDING:

MartinL.Hazelton

MasseyUniversity

77PUBLICATIONS1,094CITATIONS

SEEPROFILE

Availablefrom:MartinL.Hazelton

Retrievedon:17August2015

Page 2: Final

STATISTICS IN MEDICINEStatist. Med. 2005; 24:3361–3381Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/sim.2193

Tutorial in biostatistics: spline smoothing with linearmixed models

Lyle C. Gurrin1;∗;† Katrina J. Scurrah2 and Martin L. Hazelton3

1Epidemiology and Biostatistics Unit; University of Melbourne; Australia2Department of Physiology; University of Melbourne; Australia

3School of Mathematics and Statistics; University of Western Australia; Australia

SUMMARY

The semi-parametric regression achieved via penalized spline smoothing can be expressed in a linearmixed models framework. This allows such models to be �tted using standard mixed models softwareroutines with which many biostatisticians are familiar. Moreover, the analysis of complex correlated datastructures that are a hallmark of biostatistics, and which are typically analysed using mixed models, cannow incorporate directly smoothing of the relationship between an outcome and covariates. In this paperwe provide an introduction to both linear mixed models and penalized spline smoothing, and describethe connection between the two. This is illustrated with three examples, the �rst using birth data fromthe U.K., the second relating mammographic density to age in a study of female twin–pairs and thethird modelling the relationship between age and bronchial hyperresponsiveness in families. The modelsare �tted in R (a clone of S-plus) and using Markov chain Monte Carlo (MCMC) implemented in thepackage WinBUGS. Copyright ? 2005 John Wiley & Sons, Ltd.

KEY WORDS: best linear unbiased prediction; genetic variance components; Markov chain MonteCarlo; mixed models; non-parametric regression; semi-parametric regression; penalizedsplines; twins; WinBUGS

1. INTRODUCTION

Linear mixed models [1] play a fundamental role in the practice of biostatistics since theyextend the linear regression model for continuously valued outcomes to allow for correlatedresponses, and in particular for the analysis of longitudinal or clustered data within a hierar-chical structure. Most commercial statistical software packages o�er at least a basic facility

∗Correspondence to: Lyle C. Gurrin, Epidemiology and Biostatistics Unit, School of Population Health, University ofMelbourne, 2/723 Swanston Street, Carlton, VIC 3053, Australia.

†E-mail: [email protected]

Contract=grant sponsor: National Breast Cancer FoundationContract=grant sponsor: National Health and Medical Research CouncilContract=grant sponsor: Merck, Sharp & Dohme Research FoundationContract=grant sponsor: Canadian Breast Cancer Research Institute

Received 27 January 2004Copyright ? 2005 John Wiley & Sons, Ltd. Accepted 3 January 2005

Page 3: Final

3362 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

in which to �t mixed models, and many have extensive suites of routines that allow modelsof arbitrary complexity. By incorporating random e�ects on the scale of the linear predictor,one can also model correlation between observations where the data are discrete or involvecensoring by moving to generalized linear mixed models [1]. As such, linear mixed modelsare a natural starting point for developing general methods for the analysis of many of thedata structures that arise in medical research.When including continuously valued covariates in a linear mixed model, most biostatis-

ticians take a fully parametric approach derived from linear regression. The mean of themeasured outcome is taken to depend on the covariate value in a linear manner, or in anon-linear fashion that one assumes can be captured by a low-order polynomial. There will,however, be circumstances where this approach is at best a crude approximation. A scatterplot of the data may reveal that an apparently strong association between the outcome andcovariate nevertheless changes in a complicated way across the full range of the covariate.Alternatively, it may be di�cult to discern the form of the relationship, despite a large samplesize, due to considerable residual variability in the data. In this case an appropriate parametricmodel will not be obvious, with low-order polynomial terms providing a poor �t to the data.Even in the case where polynomial terms provide a satisfactory �t, it is unlikely that the truerelationship has a simple polynomial form.Fortunately there is a considerable literature on semi-parametric regression and non-para

metric smoothing, where the non-linear association between the outcomes and a covariate isnot constrained by the parametric forms described above. Such models seek to describe thelocal structure of the relationship between outcome and covariate, providing a good �t to thedata as we move across the range of the covariate. Such techniques include kernel smooth-ing, spline smoothing and generalized additive models. Many of the large statistical softwarepackages provide access to a selection of these methods of smoothing. Such implementationsare, however, unlikely to be integrated with the conventional regression-based model �ttingroutines in the same package. Moreover, the use of techniques such as generalized additivemodels raises issues of statistical inference that are outside the domain of linear regressionmodelling. For many biostatistical analyses it is unlikely that any attempt will be made toexplore complex non-linear relationships involving secondary covariates that potentially con-found the relationship between the outcome measure and a primary covariate unless suchmodelling can be carried out in standard software.In this paper we seek to elucidate a recent important connection that has been made between

semi-parametric regression models that achieve smoothing using penalized splines and linearmixed models. We will demonstrate how the relationship between an outcome and covariatecan be modelled non-parametrically within the framework of standard parametric mixed mod-els analysis. Not only does this allow smoothing to be carried out using any of the availablesoftware for mixed models, it also allows data analysts to incorporate semi-parametric regres-sion while retaining the bene�ts of a mixed models approach to the analysis of correlateddata structures that are often a feature of biostatistical data analysis.The paper is arranged as follows. In the next section we provide a review of linear mixed

models and introduce necessary notation. Spline smoothing for non-parametric regression iscovered in Section 3. We begin with spline smoothing with a single covariate, illustrating theideas using data on the annual number of live births in the U.K. over the 50 year period from1865 to 1914. We go on to discuss extensions, including modelling with multiple explanatoryvariables (generalized additive models), semi-parametric regression models, and models with

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 4: Final

TUTORIAL IN BIOSTATISTICS 3363

correlated data structures. In Section 4, we exemplify these methods through the analysis ofdata on mammographic density, a known risk factor for breast cancer, measured for twin–pairs.This study focuses on the relationship between mammographic density and age, taking intoaccount the within-pair correlation structure that appears to be due to genetic factors [2]. We�nd that, cross-sectionally, mammographic density tends to decrease with age after 45 years,but that the relationship is non-linear. A more substantial example is presented in Section 5,where the relationship between age and the log-hazard of bronchial hyperresponsiveness onthe dose scale of an in�ammatory agent is captured using a spline term in a generalized linearmixed model. In Section 6, we summarize the main points of the paper and mention someextensions to the methods described earlier. Computing code for implementing the secondexample in R [3] and WinBUGS [4, 5] appears in Appendix A.

2. REVIEW OF LINEAR MIXED MODELS

The standard linear mixed model has the form

y=XR+Zu+ U

where y is a vector of observed responses, and X and Z are design matrices associated witha vector of �xed e�ects R and a vector of random e�ects u, respectively. The random e�ectsvector u has zero mean and covariance matrix G, and U is a vector of residual error termswith zero mean and covariance matrix R. The dimensions of the design matrices X and Zmust conform to the lengths of the observation vector y and the number of �xed and randome�ects, respectively. It is generally assumed that the elements of u are uncorrelated with theelements of U, in which case the covariance matrix of the random e�ects and residual errorsterms is block diagonal:

Var

[u

U

]=

[G 0

0 R

]

The matrices Z and G will themselves be block diagonal if the data arise from a hierarchical(or ‘multilevel’) structure, where a �xed number of random e�ects common to observationswithin a single higher-level unit are assumed to vary across the units for a given level of thehierarchy. Typically we take the vectors of residual errors U to be independent and identicallydistributed and thus R=�2� I where �2� is the residual variance, although other structures arepossible [6–8]. The covariance matrix G of the random e�ects vector u is often assumed tohave a structure that depends on a series of unknown variance component parameters thatneed to be estimated in addition to the residual variance �2� and the vector of �xed e�ects R.The universal estimators of the �xed and random e�ects are the best linear unbiased esti-

mator (BLUE) R of R and the best linear unbiased predictor (BLUP) u of u which can berecovered as the solution to the mixed model equations [9]:

[X′R−1X X′R−1Z

Z′R−1X Z′R−1Z+G−1

](R

u

)=

[X′R−1y

Z′R−1y

]

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 5: Final

3364 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

Robinson [10] presents several derivations of these equations, a key feature of which is thatan explicit assumption regarding the distribution of the random e�ects and residual error termis not necessary in order to make progress in the estimation of R and u. The criteria of bestlinear unbiased prediction is su�cient. It is, however, far more common to work explicitlywith the Gaussian mixed model, where u and U are assumed to be multivariate normal:[

u

U

]∼ N

([0

0

];

[G 0

0 R

])

This normality assumption allows the construction of a likelihood function for (R;G;R), thelogarithm of which is

l(R;G;R)= − 12{n log(2�) + log |V|+ (y −XR)′V−1(y −XR)}

where

V=Var(y)=ZGZ′ +R

By assuming that the parameters de�ning the covariance matrices G and R are known it isstraightforward to show that the maximum likelihood (ML) estimate R of R is

R=(X′V−1X)−1X′V−1y

which, although it is not obvious algebraically, must also satisfy the mixed model equationspresented earlier since one of the ways in which these equations can be derived is directlyfrom the multivariate normality assumption. Typically G and R will not be known, and canbe estimated by substituting the expression for R back into l(R;G;R) (generating the pro�lelog-likelihood for the covariance matrices) and maximizing the result over the parametersde�ning G and R. An alternative is to use restricted maximum likelihood estimation (REML)for the variance components, which involves maximizing the likelihood of a set of linearcombinations of the data vector y which, by construction, do not depend on R. The lattercondition is achieved by working with error contrasts, that is, linear combinations of theform t′y where E(t′y)=0, which is equivalent to requiring that t′X= 0. Crainiceanu andRuppert [11] have recently demonstrated that REML estimates are not always unbiased, butthe bias is less severe than for ML estimates.Once estimates for R, G and R have been determined, we can return to the mixed model

equations and determined the best linear unbiased predictor u of random e�ects vector u asthe vector that minimizes the expected mean squared error of prediction

E{(u − u)′(u − u)}It is well known that the BLUP of u can be expressed as the posterior expectation of therandom e�ects given the data

u= E{u|y}which can be solved explicitly under the normality assumption to yield

u=GZ′V−1(y −XR)

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 6: Final

TUTORIAL IN BIOSTATISTICS 3365

Henderson [9] showed that the covariance matrix for R− R and u − u is

C=

[X′R−1X X′R−1Z

Z′R−1X Z′R−1Z+G−1

]−1

Although it is not possible to perform this matrix inversion explicitly for general covariancematrices G and R, it is straightforward to show that the covariance matrix of the �xed e�ects isVar(R)= (X′V−1X)−1. One other useful relationship is that Var(u)=Cov(u; u)=G−Var(u−u).Note that the covariance matrices of Var(u) and Var(u − u) are not in general the same dueto the assumption that the components of u are random e�ects and thus have intrinsic vari-ability in addition to that incurred by predicting u from the data. In practice the expressionfor the estimators R and u are evaluated using values of covariance matrices G and R thatare themselves estimated from the data, which results in a downward bias in the estimates ofthe sampling variability of the �xed and random e�ects; see Reference [12].

3. SEMI-PARAMETRIC REGRESSION

3.1. Spline smoothing

Suppose that we want to model the relationship between a continuous response Y and a singlecovariate x by

E[Yi]=m(xi) + �i (i=1; : : : ; n) (1)

where m is an arbitrary smooth function giving the conditional mean of Y , and �1; : : : ; �nare independent error random variables with common mean zero and variance �2� . This isa non-parametric regression model because m is not constrained to be of any pre-speci�edparametric form (e.g. linear). A popular approach to estimating m is using splines. The linearspline estimator is of the form

m(x)=�0 + �1x +K∑k=1uk(x − �k)+ (2)

where

(x − �k)+ ={0; x6�k

x − �k ; x¿�k

and �1; : : : ; �K are knots. Equation (2) describes a sequence of line segments tied together atthe knots to form a continuous function. More generally, we can extend (2) to be a piecewisepolynomial of degree p, de�ning

mp(x; R; u)=�0 + �1x + · · ·+ �pxp +K∑k=1uk(x − �k)p+ (3)

where R=(�0; : : : ; �p)′ and u=(u1; : : : ; uk)′ denote vectors of coe�cients. The functions 1;x; : : : ; xp; (x − �1)p+; : : : ; (x − �K)p+ in (3) are basis functions, since any spline of order p with

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 7: Final

3366 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

1870 1890 1910

year

live

birt

hs (

in 1

000s

)

Fixed effects at knots

1870 1890 1910

750

800

850

900

950

750

800

850

900

950

year

live

birt

hs (

in 1

000s

)

Random effects at knots

Figure 1. Linear spline regression functions for the U.K. births data. In the left panel allregression coe�cients are �xed e�ects estimated by OLS. In the right panel the coe�cients

at the knots are random e�ects.

the given knots is a linear combination of these function. Quadratic (p=2) and cubic (p=3)splines are common choices in practice because they ensure a certain degree of smoothnessin the �tted curve.To implement spline smoothing in practice the coe�cients R and u must be estimated. In

principle we could apply the usual method of least squares, but this tends to result in a ratherrough function estimate. To appreciate why, consider the linear spline model. The coe�cientsu1; : : : ; uK represent the changes in gradient between consecutive line segments. In ordinaryleast squares (OLS) these quantities can be of large magnitude, resulting in relatively rapid�uctuations in the estimate of m. A greater degree of smoothness can be achieved by shrinkingthe estimated coe�cients towards zero. This is precisely what occurs when OLS estimates arereplaced by BLUPs, and suggests that a smooth estimate of m might be obtained by regardingu1; : : : ; uK as random coe�cients, distributed independently as uk ∼ N(0; �2u). The e�cacy ofthis approach is illustrated by the non-parametric regression functions for the U.K. birth datadisplayed in Figure 1. The left-hand panel shows the results of estimating u1; : : : ; uK in (2) byOLS; the �tted curve is rather rough. Constraining the coe�cients u1; : : : ; uK to come from acommon distribution has the e�ect of damping changes in the gradient of �tted line segmentsfrom one knot point to the next, resulting in the smooth regression function displayed in theright-hand panel.Let the n response and covariates values be accumulated in vectors y=(y1; : : : ; yn)′ and

x=(x1; : : : ; xn)′, respectively, and write 1 for the vector of all ones. Consider the modelimplied by equation (2), with the Gaussian distributional assumption governing both the errorvector U=(�1; : : : ; �n)′ and the coe�cient vector u. If we de�ne the n× 2 �xed e�ects designmatrix by

X=[1 x]

and the n×K random e�ects design matrix by

Z=[(x − �11)+; : : : ; (x − �K1)+]

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 8: Final

TUTORIAL IN BIOSTATISTICS 3367

then this model can be expressed as

y=XR+Zu+ U

where [u

U

]∼ N

([0

0

];

[�2uI 0

0 �2� I

])

In other words, a non-parametric regression implemented using spline smoothing can be ex-pressed as a linear mixed model.The connection between mixed models and spline smoothing methods can also be estab-

lished by considering that the estimators R and u minimize the penalized least squares (PLS)function of Reference [13]

PLS(R; u)= ‖y −XR−Zu‖2 + �‖u‖2 (4)

where � is the ratio of variance components �2� =�2u and ‖u‖ is the usual Euclidean norm of

the vector u. The minimization of PLS(R; u) is penalized in the sense that the magnitudeof the random e�ect coe�cients in u are constrained not to grow too large. The particularpenalty ‖u‖2 results from the use of the normal distribution; other penalties are possible [14].This produces smooth �tted curves that can be shown to be spline smoothers [15], thusformally connecting linear mixed models to spline smoothing. One bene�t of thinking aboutsmooth regression modelling in this fashion is that (4) makes it clear that �=�2� =�

2u is a

smoothing parameter. As its value is increased, the penalty term receives greater weight andthe regression becomes smoother at the expense of a less close �t to the data. In other non-parametric smoothing problems such as kernel regression (e.g. Reference [16]) the choice ofsmoothing parameter is crucial, but often di�cult in practice. One of the attractions about�tting spline smoothers as linear mixed models is that � can be selected in a very naturalfashion using REML, although the user may choose to experiment with alternative values ofthis parameter.The solution for R and u that minimizes the PLS function can also be written as[

R

u

]=(C′C+ �D)−1C′y

where C=[X Z] and D=diag(0m; 1; : : : ; 1), the vector 0m representing the m-dimensional zerovector where m is the dimension of the vector R of �xed regression coe�cients. This equationcan be recovered by substituting R=�2� I and G=�2uI into the mixed model equations above,and represents a particular example of ridge regression.

3.2. Spline smoothing in the literature

There is a very large literature on spline smoothing. For those wishing to go beyond thecoverage in the present paper, the article by Wand [14] and the books by Hansen et al. [17]and Ruppert et al. [18] are recommended. The origin of smoothing splines lies can be tracedback to the work of Whittaker [19] on graduating data. However, spline smoothing receivedlittle attention from statisticians until the utility of this technique was demonstrated by the

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 9: Final

3368 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

research e�orts of Wahba and co-workers [20–23]. The monograph on spline models byWahba [24] contains many theoretical results that prompted the recent developments that usethe connection between spline smoothing and linear mixed e�ects models, although the �rstexplicit mention of this link seems to be due to Speed [25] in his discussion of Robinson’s1991 paper on the estimation of random e�ects [10].Di�erent types of spline can be obtained by altering the choice of knots, and by changing the

manner in which roughness in the estimated regression function is penalized. One approachis to allow a knot at each (discrete) value of the x variable which, given an appropriatechoice of roughness penalty, leads to a natural cubic smoothing spline [15]. We prefer tofocus on penalized splines, also called P-splines, a terminology introduced by Eilers andMarx [26] and Marx and Eilers [27]. P-splines are characterized by the use of relativelymodest, �xed, number of knots (usually K � n, though we say more on knot selection later).They are very similar in spirit to low-rank pseudo-splines proposed by Hastie [28]. See alsoReferences [29–32].Since the end of the 1990s a body of research has developed on P-splines. Methodological

developments include Brumback et al.’s [33] contribution on mixed-model representations ofP-splines; Ruppert’s [34] work on the choice of knots; Cai et al.’s [35] use of P-splines inhazard function estimation; and Lang and Brezger’s work on Bayesian P-splines [36]. See alsoReferences [37–42]. Some examples of applications of P-splines include Greenland’s paperon HIV incidence [43], Marx and Eilers contribution on calibration and chemometrics [44],Kauermann and Ortlieb’s paper on modelling patterns of sick leave [45], and Eisen et al.’spaper on occupational cohort studies [46].

3.3. Knot speci�cation

The location of the knots must be speci�ed in advance of �tting the model, and they aresupplied implicitly to the chosen software routine via the realized values of the spline basisfunction in the random e�ects design matrix Z. Wand [14] comments that knot speci�cationis ‘very much a minor detail for penalized splines’, and Ruppert [34] notes that ‘becausesmoothing is controlled by the penalty parameter, �, the number of knots, K , is not a crucialparameter’. We concur with Wand’s suggestion that a reasonable default rule for the locationof the knots is

�k =(k + 1K + 2

)th sample quantile of the unique xi’s; 16 k6K

where K = min(n=4; 35). Some additional algorithms, empirical results and further commentaryon the topic of knot selection are supplied in Reference [34].

3.4. Extensions

Several extensions to the simple non-parametric regression model in (1) are available withinthe linear mixed models framework. We may wish to model the relationship between a con-tinuously valued response Yi and multiple continuously valued covariates. Illustrating this forthe case of two covariates x and w, with � de�ned as before, we have

E[Yi]=m(xi) + l(wi) + �i (i=1; : : : ; n) (5)

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 10: Final

TUTORIAL IN BIOSTATISTICS 3369

where m and l are smooth functions of the corresponding covariates contributing to theconditional mean of Yi. This is a generalized additive model as described by Hastie andTibshirani [47], implemented in the gam() routine in R or S-plus. These types of models areeasily implemented in general mixed model software by de�ning knot locations and splinebasis function separately for both the covariates x and w, declaring each of the correspondingcoe�cient vectors to have independent Gaussian distributions and estimating them using bestlinear unbiased prediction.It may be that (5) is unnecessarily complex because E[Y ] is (approximately) linear in w.

We might then prefer a semi-parametric regression model which includes both non-parametric(spline) and linear terms:

E[Yi]=m(xi) + �1wi + �i (i=1; : : : ; n) (6)

Another form of semi-parametric regression occurs if extra random e�ects are included formodelling clustered and/or longitudinal data. We investigate an example with clustered datain some detail in the next section. For longitudinal data, if yij denotes the observation onindividual i at the jth time point (j=1; : : : ; m) then a simple extension of (1) is

E[Yij]=m(xij) + vi + �ij (i=1; : : : ; n; j=1; : : : ; m) (7)

where v1; : : : ; vn are individual speci�c random e�ects. These ideas may be combined andgeneralized to produce intricate models capable of capturing non-linear relationships betweenresponse and covariates while accounting for complicated correlation structures in the data,all within the framework of linear mixed models.

4. APPLICATION: MODELLING MAMMOGRAPHIC DENSITY VERSUS AGE FROMTWIN DATA

In this section we use a linear mixed model to implement a semi-parametric regressionmodel of the relationship between mammographic density and age using data from femalestwin–pairs.

4.1. Mammographic density, age and risk of disease

Women with extensive dense breast tissue determined by mammography are known to beat higher risk of breast cancer than women of the same age with lower mammographicdensity. Previous work on pairs of both identical or monozygous (MZ) and non-identical ordizygous (DZ) female twins from samples in Australia and North America provided strongevidence of a genetic e�ect on mammographic density [2]. Current interest focuses on studyingthe relationship between mammographic density and individual covariates. Age is known toin�uence both the risk of breast cancer and the percent mammographic density, and it is age-adjusted mammographic density that predicts breast cancer risk. Thus any covariate analysismust adjust a priori for age in order to minimize any confounding in�uence on the riskrelationship between mammographic density and other covariates.The relationship between age and mammographic density is not immediately apparent from

inspection of a scatter plot (see Figure 2), apart from a sparsity of older women with highmammographic density, suggesting that mammographic density decreases with increasing age.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 11: Final

3370 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

40 45 50 55 60 65 70

0

20

40

60

80

100

Age at mammogram (years)

Per

cent

bre

ast d

ensi

ty (

%)

Figure 2. Fitted regression functions for the per cent mammographic density data. The dotted lineand dashed curve represent, respectively, the simple linear regression and a cubic polynomial �t,both based on OLS. The solid line displays the result of a linear spline regression with knots at

each distinct value of age at mammogram.

We hope to elucidate the relationship between age and mammographic density using splinesmoothing while specifying a random e�ects structure to account for the within twin–paircorrelations.

4.2. Data and models

The data consist of measurements on per cent mammographic density, which is the ratio ofdense breast tissue area to total breast tissue area determined by mammographic imaging, andage at mammogram in years recorded as an integer. Per cent mammographic density rangesfrom 0 to 90 per cent with mean 37 per cent; age ranges from 38 to 71 with mean 51years. Data are available on 951 twins–pairs, 599 from Australia (353 MZ and 246 DZ) and352 from North America (218 MZ and 134 DZ). We represent the per cent mammographicdensity data perc.density as the response vector y= {yij} where i=1; : : : ; 951 indexes thetwin–pair and j=1; 2 twins within pairs. The covariate age is represented by the vector xwith identical structure to y. We seek to �t a model

perc:densityij=m(ageij) + hij + �ij (8)

where m is some smooth function of age, with the random e�ect hij capturing the within-paircorrelation structure and �ij representing an uncorrelated random error term with variance �2� .We begin by specifying a model to �t a penalized spline for the relationship between

mammographic density and age. We use a linear spline basis as speci�ed in equation (2), sothe �xed e�ects design matrix X has just two columns. There are su�cient data to warrant

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 12: Final

TUTORIAL IN BIOSTATISTICS 3371

locating a knot at each distinct value of age, so the random e�ects design matrix Z has 34columns since the age range is 38–71 years.We incorporate the within-pair correlation into the regression structure by including addi-

tional random e�ects in the model that are shared within each twin–pair ‘cluster’, while alsoallowing for the possibility that the within-pair correlations for MZ and DZ pairs are di�erent.For the ith twin–pair we generate two random e�ects ai1 and ai2, each with variance �2a andcorrelation between them of �. If the ith twin–pair is MZ then for both twins ai1 is added tothe �xed e�ects linear predictor, that is, hij= ai1 for j=1; 2 in equation (8). For DZ twins weadd ai1 to one of the twins and ai2 to the other, so we have simply hij= aij for j=1; 2. Notethat the ordering of the twins within a given pair is arbitrary regardless of their zygosity. Inthis case the total variance in the trait for an individual is just �2a + �

2� , with covariance of

�2a for MZ pairs and ��2a for DZ pairs, implying that the within-pair correlation in DZ pairs

will be less than in MZ pairs. The special case of �=0:5 implies a standard additive geneticmodel known as the classical twin model [48, 49].More formally, for our N =951 twin–pairs we update the vector of random e�ects from u

to u∗=(u′; a11; a12; : : : ; aN1; aN2)′ and augment the original random e�ects design matrix Z toZ∗ where

Z∗=

[Z 0

0 Z twin

]

The additional design sub-matrix Ztwin is block diagonal with 2× 2 blocks, the ith block beingthe 2× 2 identity matrix if the ith pair are DZ twins, and a 2× 2 matrix with a column ofones followed by a column of zeros if the ith pair are MZ twins. The covariance matrix forthe random e�ects vector u must also be updated from G to G∗ where

G∗=

[G 0

0 Gtwin

]

The sub-matrix Gtwin is block diagonal with N identical 2× 2 blocks S where

S=�2a

[1 �

� 1

]

The residual error covariance matrix R remains as �2� times the identity matrix with appro-priately expanded dimension. There are of course other ways of capturing this correlationstructure using realized random e�ects but this is by far the easiest when we are restrictedto using only two genetic random e�ects per twin–pair cluster as required by some of thesoftware routines we explored when �tting these models.The models were �tted using the lme() routine in R and in a Markov chain Monte Carlo

(MCMC) setting using the package WinBUGS; the results were similar and the relevant com-puting code for both implementations is presented in Appendix A. For additional examplesof the use of R, WinBUGS and SAS [50] in spline smoothing see References [14, 51, 52].The REML estimates (from the lme() routine in R) of the three variance parameters

were �2a=282:31, �2u=0:205 and �

2� =125:57, and the estimated within-pair correlation for

DZ twins was �=0:507, very close to the null value of �=0:5 for the standard additive

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 13: Final

3372 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

genetic model. Note that the total variance is now �2a+�2u+�

2� rather than the �

2a+�

2� implied

by the original model for within-pair correlation, although the magnitude of the estimatedpenalized spline variance component �2u is very modest compared to the estimated values for�2a and �

2� . In this example an adjustment for the e�ect of age on mammographic density

would always be necessary before imposing a genetic variance component structure, so itmake sense to interpret the estimated values of �2a and �

2� conditional on the estimated value

of �2u in the same way that we condition implicitly on the value of estimated �xed e�ectswhen interpreting (components of) the residual variance.The �tted spline regression is displayed in Figure 2, along with the OLS �ts of linear

and cubic polynomial regressions. The spline �t reveals that the mean mammographic densityincreases slightly with age in range 40–45 years (the maximum �tted value is 44.3 per cent at44 years of age) but decreases for the remaining age range. The cubic polynomial �t largelyreproduces the behaviour of the semi-parametric model, albeit it with a spurious increase indensity after 65 years that is a consequence of the parametric form of the regression function.The cubic �t is nonetheless a signi�cant improvement on the linear �t (p=0:0025) indicatingthat the features of the data exposed by the semi-parametric regression are not artefacts.

5. SPLINE SMOOTHING AND GENERALIZED LINEAR MIXED MODELS

In the examples we have considered so far, the smoothing of the relationship between aresponse and a covariate has taken place directly on the scale of the continuously valuedoutcome variable. By appealing to the role of the smoothing term in de�ning a model for theexpected value of the outcome, the mixed model implementation of spline smoothing can beextended to encompass regression models for outcomes such as binary and count data thatare not continuously valued and hence cannot be assumed to be even approximately normallydistributed. Such data can be modelled using a generalized linear mixed model (GLMM) [1],which we introduced using the notation of Section 2 as a two-parameter exponential familydensity f for y, where

f(y|u)= exp{(yt(XR+Zu)− 1tb(XR+Zu))=�+ 1tc(y; �)} (9)

with b and c both scalar functions that are applied component-wise to the vector each takes asits argument. Note that c is a function of the data vector y that involves the scale parameter �but not the �xed or random e�ect parameters. If the terms involving the vector u of randome�ects, assumed to be normally distributed with mean zero and covariance matrix G, areabsent then the model reduces to a generalized linear model (GLM) [53, 54].The term XR+Zu is the linear predictor containing both �xed and random e�ects and is

related to the conditional expectation of y via the link function g such that g(E(y|u))=XR+Zu.Smoothing of the relationship between the response and covariate naturally take places on thescale of the linear predictor, and additional terms can be included in the speci�cation of therandom e�ects to accommodate the truncated linear basis that facilitates the mixed modelimplementation of spline smoothing presented in Section 3.In principle the �xed e�ect and variance parameters of a generalized linear mixed model

can be estimated using maximum likelihood, with the random e�ects estimates following fromBLUP. This process, however, is much more computationally challenging than estimating theparameters of the standard linear mixed models due to the fact that the required integration

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 14: Final

TUTORIAL IN BIOSTATISTICS 3373

over the (typically high dimensional) distribution of the random e�ects cannot be performedexplicitly and must be accomplished numerically. Common approaches use the Laplace ap-proximation which is equivalent to penalized likelihood [13, 55]. Ngo and Wand [51] notethat for a user-speci�ed value of the variance component corresponding to the smoothingparameter, �tting a GLMM reduces to iteratively reweighted least squares ridge regression.Routines to implement GLMMs are available in both R, SAS and Stata [56] although theseapproximate methods (often based on quasi-likelihood) do not necessarily reproduce the trueML estimates to any degree of accuracy.

5.1. Application: using pedigree data to model the age- and sex-speci�c risk of bronchialhyperresponsiveness

In this section we illustrate the application of mixed model smoothing in GLMs using anexample from asthma genetics research. The aim of the original analysis of these data wasto determine if there was an association between bronchial hyper-responsiveness (BHR) andalleles of a genetic marker in the interleukin-9 (IL9) gene [57]. By accounting for the within-family correlation due to shared genetic and environmental in�uences it is possible to estimatethe extent to which the residual variation after adjustment for IL9 depends on unmeasuredgenetic factors. It is also necessary to adjust for the e�ect of age and sex on BHR. Althoughwe can stratify analyses for sex, the role of age in determining the risk of BHR is not knowna priori and may change as the individual moves from childhood through adolescence andinto adulthood. It is unclear whether �tting a linear term or some other low-order polynomialis a good approximation to the correct but unknown functional form relating age to the log-hazard of BHR with increasing doses of a bronchial agonist. A more �exible approach is to�t age as a spline term in the model for BHR, which we demonstrate here.The data are from long-standing cohort studies in Busselton (Western Australia) and

Southampton (U.K.). In the Busselton study, caucasian families with both parents alive andaged under 55 years and with at least two children over the age of 5 years were recruited.In the Southampton study, caucasian families with three or more children were recruitedthrough contact at local general practices with the parents of 1800 children aged 11–14 years[58, 59].Individual lung function was assessed repeatedly on a single occasion as part of a bronchial

challenge [60], which is designed to simulate an asthma attack. During this procedure, forcedexpiratory volume in 1s (FEV1) was measured after an initial dose of saline and after each in-creasing dose of a bronchoconstrictor drug (methacholine in Busselton, histamine in Southamp-ton). The event of interest was a 20 per cent fall from post-saline FEV1, characterized bythe dose at which this fall occurred and estimated by linear interpolation of the observedresponses bracketing the critical fall. If a 20 per cent fall was not achieved at or before thehighest permissible dose had been administered (12 �mol for methacholine and 2:45 �mol forhistamine) the response was censored at the maximal dose. Such measurements can be con-sidered as ‘time’ to event data, although the distance to the event is measured on the doseand not time scale.Complete data were available for 823 individuals (213 from Busselton, 610 from Southamp-

ton) in 199 nuclear families. A total of 101 subjects (12.3 per cent) responded (40 (18.8 percent) in Busselton and 61 (10.0 per cent) in Southampton). Of the remaining 722 subjects,the majority were end-censored at the maximal dose.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 15: Final

3374 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

5.2. Piecewise exponential model

Data were analysed using a piecewise exponential model. In this model, the cumulative doseis divided into N arbitrary intervals (D1; : : : ; DN ) of �nite length (U1; : : : ; UN ), such that indi-vidual j in the ith family is ‘at risk of response’ during interval Db up to their recorded dosedijb, where 0¡dijb¡Ub. A censoring indicator (yijb) denotes the response in the ijth indi-vidual during interval Db : yijb=1 if the individual failed during Db and yijb=0 otherwise.Declaring that yijb follows a Poisson distribution within each interval produces a likelihoodfunction that in the limit, when there are a large number of distinct dose intervals, is equiva-lent to Cox’s proportional hazards model [54]; inferences from the two models are remarkablysimilar even when the number of distinct intervals is small [61]. In this example we used �vedose intervals; (0; 0:4], (0:4; 1:8], (1:8; 5], (5; 10], (10; 12].More formally, the model may be expressed via the hazard, ijb, for individual ij in the

bth dose interval:

log(ijb) = log(dijb) + �0b + Rtxijb + Fi +Gi +Hij +K∑k=1uk(ageij − �k)+ (fathers)

log(ijb) = log(dijb) + �0b + Rtxijb + Fi −Gi +Hij +K∑k=1uk(ageij − �k)+ (mothers)

log(ijb) = log(dijb) + �0b + Rtxijb + Fi +Mi + Pij +K∑k=1uk(ageij − �k)+ (children)

where

Fi ∼ N(0; 12 �2A); Mi ∼ N(0; �2Cs)

Gi ∼ N(0; 12 �2A); Hij ∼ N(0; �2Cs)

Pi ∼ N(0; 12 �2A); uk ∼ N(0; �2U )

and yijb ∼ Poisson(ijb).This is a generalized linear mixed model with a log link, a Poisson error term, and a

multivariate Normal joint distribution for the higher order random e�ects [53, 55]. Here �0bis the log baseline hazard, R is a vector of regression coe�cients and xijb is a vector ofcovariates for the IL9 marker, indicating whether each individual has 0, 1 or 2 copies of eachof the four alleles of interest. Separate e�ects for each of these four alleles, and distinct logbaseline hazards �0b for each dose interval, were allowed in each population.The linear predictor includes a spline smoothing term to capture the relationship between

the log hazard and the age of the individual ageij. The ui parameters, which are assumed tobe uncorrelated with each other and the remaining random e�ects described above, representthe coe�cients for the truncated linear basis with knot points �1; : : : ; �K . Due to the largenumber of data points at each age, 27 knot points were used, at every 2 years from age6 to 60.The parameters �2A and �

2Cs are the components of variance attributable to additive genetic

e�ects and shared sibling environment, respectively. The uncorrelated random e�ects Fi, Gi,

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 16: Final

TUTORIAL IN BIOSTATISTICS 3375

Hij, Mi and Pij are shared in such a way that the total variance (on the linear predictorscale) is �2A+ �

2Cs for each individual. The covariance between two parents is zero, while the

covariance between a parent and a child is 12 �

2A and the covariance between two siblings is

12 �

2A + �

2Cs.

We took an MCMC approach to �t the models, using WinBUGS. The unconventional para-meterization for the familial random e�ects enhances convergence of the sampled parametervalues to the target posterior distribution [62]. Vague prior distributions were used for allparameters. Speci�cally, N(0; 10 000) distributions were used as priors for �xed e�ects whilePareto (0.5,0.01) distributions, on the scale of the precision (the inverse of the variance), wereused for all three variance components. This is equivalent to specifying a uniform prior on thescale of the standard deviation which, for a suitably large choice of the distribution’s upperbound (here we use 100), ensures that the results are essentially invariant to the scale of thedata. Note that the use of the traditional vague gamma prior with constant hyperparameterscorresponding to, say, a mean of 1 and a variance of 1000, produces results that are not scaleinvariant. The use of such gamma priors is no longer recommended. Models were run fora burn-in of 10 000 iterations, followed by 100 000 iterations after convergence. Two chainswere run in parallel for all models, and the estimates presented below are posterior means(and standard deviations) of the combined iterations from both chains. The relevant computingcode for WinBUGS is available from the corresponding author on request.

5.3. Results

The alleles of IL9 exhibited only weak positive associations with the risk of BHR. Thevariance component �2A, re�ecting additive polygenic e�ects, remained large at 2.48 (SD, 1.52;95 per cent credible interval, 0.16–5.99) even after the inclusion of the IL9 marker, whichimplies the existence of other genes controlling BHR. The smaller estimate of 0.61 (SD,0.76) for �2Cs, re�ecting e�ects due to a shared sibling environment, suggests that this haslittle impact on bronchial hyperresponsiveness. The smoothing parameter �2U was estimatedto be 0.0039 (SD, 0.0063; 95 per cent credible interval, 0.000025–0.0221) which, similarto the application in Section 4, was small in comparison to the other variance components.Figure 3 displays the �tted log-hazard log(ijk) for age, separately for males and females, usingboth the semi-parametric spline smoothing estimates and a fully parametric cubic polynomialcurve. Males appear to have a higher risk of response than females in childhood althoughboth cubic and spline �ts indicate that there is little relationship between age and BHR foradult males. In contrast, the risk of response rises in females from mid-teens to about age 30,then decreases consistently. This increase is surprising but may be due to sparse data for thisage group. The sharp decrease is due to the fact that no female aged over 50 responded.Clearly, the spline �t is less a�ected by this in comparison with the cubic curve. In malesthe spline �t captures local variation in the risk of response between ages 20 and 60, whichis not re�ected in the cubic polynomial �t.Although the result of the spline smoothing term for age mirrors many features of the cubic

�t, it was necessary to spend some considerable time during the original analysis actuallydiscovering the cubic dependence of log-hazard of BHR on age. In data such as these itmay be di�cult and time consuming to obtain a good parametric �t, and it would be easyfor a biostatistician to miss the need for a cubic or higher order polynomial to provide anapproximation to the complex yet unknown functional form. The use of spline smoothing

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 17: Final

3376 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

Age in years

log(

Haz

ard

ratio

)

10 20 30 40 50 60

-10

-8

-6

-4

-2

0

Age-sex risk profiles

Spline (M)Spline (F)Cubic (M)Cubic (F)

Figure 3. The log-hazard of BHR with age, �tted using both a cubic and spline term.

in a mixed model framework allows us to discern immediately the shape of the relationshipbetween age and BHR, even if subsequent analyses suggest that a parametric model �t isappropriate.

6. CONCLUSIONS

The realization that penalized splines and thus a broad class of semi-parametric regressionmodels can be cast within a linear mixed models framework a�ords biostatisticians the op-portunity to incorporate the use of smoothing techniques into the analysis of correlated datastructures. The latter are typically represented as a hierarchical model with correlation cap-tured by random e�ects common to observations at a given level of the hierarchy, and arethus naturally analysed by �tting regression models using mixed model software. We saw thatsmoothing of the relationship between age and two separate outcomes, �rstly mammographicdensity (in application 2) and secondly the risk of bronchial hyperresponsiveness (in appli-cation 3), could be achieved while simultaneously estimating a non-trivial genetic variancecomponent structure implied by the paired or familial nature of the data.Not only can linear mixed models be used to attack a very wide variety of problems,

but they can be extended to meet the challenges of missing data, measurement error andnon-normally distributed data via generalized linear mixed models as in application 3. Theirhierarchical nature means that such models lend themselves to �tting via Markov chain MonteCarlo methods. The computing code in Appendix A demonstrates that �tting penalized splinesusing the WinBUGS platform is no more di�cult for the biostatistician than using the moreconventional maximum likelihood routines in R or S-plus. Indeed, an MCMC approach isperhaps the most attractive approach currently available for �tting spline terms and randome�ects in the linear predictor for generalized linear mixed models.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 18: Final

TUTORIAL IN BIOSTATISTICS 3377

Within the mixed models implementation of penalized splines the amount of smoothingis controlled by the relative magnitude of the relevant variance component and the residualerror variance. Typically this parameter is estimated from the data, and in the three examplesin the text the amount of smoothing was dictated by the ‘default’ REML estimates of thevariance parameters (in application 2) and posterior means using vague prior distributions (inapplications 1 and 3). This gave reasonable �tted curves in the examples we consider althoughit is possible that there are scenarios where success of this method for choosing the amountof smoothing automatically will depends on the number and location of the knots. Using toofew knots with the large data set in our second example drives the magnitude of the variancecomponent associated with the spline smoothing down to an unrealistically low values resultingin very little smoothing. This can obviously be overcome by increasing the number of knots,or �tting with user-speci�ed values for the variance components, which is straightforward inWinBUGS. It is, however, not possible at this stage to work with user-speci�ed parameters withthe lme() routine in R (or S-plus). Wand [14] describes an algorithm based on Demmler–Reinsch orthogonalization that allows smoothing to be performed in R with user-speci�edparameters in the simplest scatter plot scenario such as our �rst example. Although this canbe extended to �t more complex models such as the second or third example, the dimensionof the matrices involved in this case are large despite their sparsity due to the large numberof twin–pairs.We noted in Section 4 that the variance component allocated to the coe�cients of the

spline basis functions will contribute to the total modelled variance of the observations andmay encroach on the interpretability of relationship between the remaining components if theirstructures results from a substantive model such as, in our second and third examples, thegenetic relationships that exists within twin–pairs or between family members. In both theseexamples, however, an impressive degree of smoothing was achieved with a relatively smallvariance component that had little impact on the remaining variance parameters.We might expect future versions of mixed modelling software to implement penalized spline

smoothing automatically at the request of the user, relieving the biostatisticians of the burdenassociated with creating spline bases and design matrices manually. This would not, of course,obviate the need for careful consideration of the problem at hand and a decision as to whethereven the best semi-parametric regression model is a good �t to the data.

APPENDIX A: COMPUTING CODE

The following code implements the mammographic density example in R, using the librarylmeSplines. We work with a data frame called twins that contains the following variables:

perc.density per cent mammographic densityage age at mammogram in years as integerzyg zygosity: 1 = MZ, 2 = DZpairnum twin–pair ID number 1–951twinnum within-pair–twin number 1 or 2

The relevant R computing code is:

attach(twins) # attach the data frame "twins" in position 1knots <- sort(unique(age))

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 19: Final

3378 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

y <- perc.density # set outcome "y" to percent densitycons <- rep(1,length(y)) # set "cons" to a vector of onesdz2 <- rep(0,length(y))dz2[(zyg == 2) & (twinnum == 2)] <- 1mzdz1 <- cons - dz2X <- cbind(rep(1,length(y)),age) # create fixed effects design matrix XZ <- outer(age,knots,"-") # create random effects design matrix ZZ <- Z*(Z>0)twins.fit <- lme(y ~ -1 + X, random = list(cons = pdIdent(~-1 +Z), pairnum = pdCompSymm(~-1 + mzdz1 + dz2)), method = "REML")

twins.fitted.values <- twins.fit$fitted[,2]

The vector mzdz1 contains a “1” for all MZ twins and for the (arbitrarily labelled) �rsttwin of each DZ twin–pair; the vector dz2 contains a “1” for the second twin in each DZpair, so that the sum of the vectors mzdz1 and dz2 is identically “1”. By declaring the(random) coe�cients associated with the vectors mzdz1 and dz2 within each twin–pair tohave a compound symmetry correlation structure we can reproduce the additive genetic modeldescribed in Section 4. The coe�cients are declared to vary across twin–pairs by using thepairnum twin–pair factor to name the “pdCompSymm” structure in the above call to lme().Since lme() requires all random coe�cients to vary over higher-level units, the above syntaxuses the dummy variable cons of all ones to form a single unit containing all of the data.The “pdIdent” correlation structure then constrains the random coe�cients associated withpenalized spline to have the same variance, namely �2u. The �tted values for the penalizedspline are contained in the second column of the “�tted” object since we must include therandom e�ects in the linear predictor; the �rst column of this object contains the �tted valuesbased on the single �xed linear e�ect only and in this case is meaningless without adding therandom e�ects associated with the penalized spline �t.The relevant WinBUGS computing code is:

model{

for( i in 1:951){A1[i] ~ dnorm(0.0, tau.A) # random effects for MZ/DZA2[i] ~ dnorm(0.0, tau.A)A3[i] ~ dnorm(0.0, tau.A)

}for( i in 1:34){

u[i] ~ dnorm(0.0, tau.u) # random effects for spline}for (j in 1:34){ # loop to set up the basis functions

smeanY1[j,1] <- a + b1*(j+37)for (k in 1:Nknots){

smeanY1[j,(k+1)] <- smeanY1[j,k] +u[k] * step(j+37-knot[k])*(j+37-knot[k])

}smeanY[j] <- smeanY1[j,(Nknots+1)]

}

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 20: Final

TUTORIAL IN BIOSTATISTICS 3379

for (j in 1:1902){ # loop to assign mean and distribution to# observations

meanY[j] <- smeanY[age[j]-37] +mz[j]*(sqrt(rho)*A1[pairnum[j]]+sqrt(1-rho)*A2[pairnum[j]]) +(1-mz[j])*( sqrt(rho)*A1[pairnum[j]] +sqrt(1-rho)*((1-twin1[j])*A2[pairnum[j]] + twin1[j]*A3[pairnum[j]]) )

perc.density[j] ~ dnorm(meanY[j], tau.Y)}

a ~ dnorm(0.0, 1.0E-8)b1 ~ dnorm(0.0, 1.0E-8)tau.A <-1/pow(sig.A,2)tau.u <- 1/pow(sig.u,2)tau.Y <-1/pow(sig.Y,2)rho ~ dbeta(1,1)sig.A ~ dunif[0,100]sig.u ~ dunif[0,100]sig.Y ~ dunif[0,100]

}

The variables are de�ned as for the R code, expect that mz is an indicator for an MZ twin,and twin1 is an indicator for twinnum=1. The prior distributions employed here and inapplication 3 are discussed in more detail in Section 5.2; see also Reference [63, Section 5.7]for a recent comprehensive discussion of prior distributions for variance parameters.

ACKNOWLEDGEMENTS

The authors thank the anonymous referees for their comments which have greatly improved the pre-sentation of the paper. We thank Professor Norman F. Boyd, Professor John L. Hopper and Ms GillianS. Dite for permission to use the mammographic density data, the generation of which was supportedby grants from the National Breast Cancer Foundation (Australia), the National Health and MedicalResearch Council (Australia), the Merck, Sharp & Dohme Research Foundation (Australia), and theCanadian Breast Cancer Research Institute. We thanks also Professor Lyle J. Palmer, Professor NewtonMorton and Professor Bill Cookson for permission to use the bronchial hyperresponsiveness data. Theauthors acknowledge several helpful discussions with Professor John Hopper and Ms Gillian S. Dite.

REFERENCES

1. McCulloch CE, Searle SR. Generalized, Linear, and Mixed Models. Wiley: New York, 2000.2. Boyd NF, Dite GS, Stone J et al. Heritability of mammographic density, a risk factor for breast cancer. NewEngland Journal of Medicine 2002; 347:886–894.

3. Ihaka R, Gentleman R. R: a language for data analysis and graphics. Journal of Computational and GraphicalStatistics 1996; 5:299–314.

4. Gilks WR, Thomas A, Spiegelhalter DJ. A language and program for complex Bayesian modelling. TheStatistician 1994; 43:169–178.

5. Spiegelhalter DJ, Thomas A, Best NG, Lunn D. WinBUGS Version 1.4 User Manual. MRC Biostatistics Unit,Cambridge, U.K., 2003. www.mrc-bsu.cam.ac.uk/bugs/winbugs

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 21: Final

3380 L. C. GURRIN, K. J. SCURRAH AND M. L. HAZELTON

6. Goldstein H, Healy MJR, Rasbash J. Multilevel times series models with applications to repeated measures data.Statistics in Medicine 1994; 13:1643–1655.

7. Goldstein H, Browne WJ, Rasbash J. Multilevel modelling of medical data (Tutorial in Biostatistics). Statisticsin Medicine 2002; 21:3291–3315.

8. Browne WJ, Draper D, Goldstein H, Rasbash J. Bayesian and likelihood methods for �tting multilevel modelswith complex level-1 variation. Computational Statistics and Data Analysis 2002; 39:203–225.

9. Henderson CR. Best linear unbiased estimation and prediction under a selection model. Biometrics 1975;31:423–447.

10. Robinson GR. That BLUP is a good thing: the estimation of random e�ects. Statistical Science 1991; 6:15–51.11. Crainiceanu CM, Ruppert D. Likelihood ratio tests in linear mixed models with one variance component. Journal

of the Royal Statistical Society, Series B 2004; 66:165–185.12. Kackar RN, Harville DA. Approximations for standard errors of �xed and random e�ects in mixed linear models.

Journal of the American Statistical Association 1985; 79:853–862.13. Green PJ. Penalized likelihood for general semi-parametric regression models. International Statistical Review

1987; 55:245–259.14. Wand MP. Smoothing and mixed models. Computational Statistics 2003; 18:223–249.15. Green PJ, Silverman BW. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty

Approach. Chapman & Hall: London, 1994.16. Wand MP, Jones MC. Kernel Smoothing. Chapman & Hall: London, 1995.17. Hansen MH, Huang JZ, Kooperberg C, Stone CJ, Truong YK. Statistical Modelling with Spline Functions:

Methodology and Theory. Springer: New York, 2003.18. Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press: Cambridge, U.K.,

2003.19. Whittaker ET. On a new method of graduation. Proceedings of the Edinburgh Mathematical Society 1923;

41:63–75.20. Kimeldorf G, Wahba G. A correspondence between Bayesian estimation of stochastic processes and smoothing

by splines. Annals of Mathematical Statistics 1970; 41:495–502.21. Wahba G. Improper priors, spline smoothing and the problem of guarding against model errors in regression.

Journal of the Royal Statistical Society, Series B 1978; 40:364–372.22. Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik 1979; 31:377–403.23. Wahba G. Bayesian con�dence intervals for the cross-validated smoothing spline. Journal of the Royal

Statistical Society, Series B 1983; 45:133–150.24. Wahba G. Spline Models for Observational Data. Wiley: New York, 1990.25. Speed TP. Comment in discussion of Robinson GR. That BLUP is a good thing: the estimation of random

e�ects. Statistical Science 1991; 6:15–51.26. Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Statistical Science 1996; 11:89–121.27. Marx BD, Eilers PHC. Direct generalized additive modelling with penalized likelihood. Computational Statistics

and Data Analysis 1998; 28:193–208.28. Hastie T. Pseudosplines. Journal of the Royal Statistical Society, Series B 1996; 58:379–396.29. O’Sullivan F. A statistical perspective on ill-posed inverse problems (with discussion). Statistical Science 1986;

1:505–527.30. O’Sullivan F. Fast computation of fully automated log-density and log-hazard estimators. SIAM Journal on

Scienti�c and Statistical Computing 1988; 9:363–379.31. Kelly C, Rice J. Monotone smoothing will applications to dose-response curves and the assessment of synergism.

Biometrics 1990; 46:1071–1085.32. Gray RJ. Spline-based tests in survival analysis. Biometrics 1992; 50:640–652.33. Brumback B, Ruppert D, Wand MP. Comment on variable selection and function estimation in additive

nonparametric regression using data-based prior by Shively TS, Kohn R, Wood S. Journal of the AmericanStatistical Association 1999; 94:794–797.

34. Ruppert D. Selecting the number of knots for penalized splines. Journal of Computational and GraphicalStatistics 2002; 11:735–757.

35. Cai T, Hyndman RJ, Wand MP. Mixed model-based hazard estimation. Journal of Computational and GraphicalStatistics 2002; 11:784–798.

36. Lang S, Brezger A. Bayesian P-splines. Journal of Computational and Graphical Statistics 2004; 13:183–212.37. Aerts M, Claeskens G, Wand MP. Some theory for penalized spline generalized additive models. Journal of

Statistical Planning and Inference 2002; 103:455–470.38. Berry SM, Carroll RJ, Ruppert D. Bayesian smoothing and regression splines for measurement error problems.

Journal of the American Statistical Association 2002; 97:160–169.39. Yu Y, Ruppert D. Penalized spline estimation for partially linear single-index models. Journal of the American

Statistical Association 2002; 97:1042–1054.40. Durban M, Currie ID. A note on P-spline additive models with correlated errors. Computational Statistics 2003;

18:251–262.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381

Page 22: Final

TUTORIAL IN BIOSTATISTICS 3381

41. Crainiceanu CM, Ruppert D. Likelihood ratio tests in linear mixed models with one variance component. Journalof the Royal Statistical Society, Series B 2004; 66:165–185.

42. Fahrmeir L, Kneib T, Lang S. Penalized structured additive regression for space-time data. Statistica Sinica2004; 14:731–761.

43. Greenland S. Historical HIV incidence modelling in regional subgroups: use of �exible discrete models withpenalized splines based on prior curves. Statistics in Medicine 1996; 15:513–525.

44. Marx BD, Eilers PHC. Multivariate calibration stability: a comparison of methods. Journal of Chemometrics2002; 16:129–140.

45. Kauermann G, Ortlieb R. Temporal pattern in number of sta� on sick leave: the e�ect of downsizing. Journalof the Royal Statistical Society, Series C-Applied Statistics 2004; 53:353–367.

46. Eisen EA, Agalliu I, Thurston SW, Coull BA, Checkoway H. Smoothing in occupational cohort studies: anillustration based on penalised splines. Occupational and Environmental Medicine 2004; 61:854–860.

47. Hastie TJ, Tibshirani RJ. Generalized Additive Models. Chapman & Hall: London, 1990.48. Hopper JL, Mathews JD. A multivariate normal model for pedigree and longitudinal data and the software

Fisher. Australian Journal of Statistics 1994; 36:153–167.49. Williams CJ. On the covariance between parameter estimates in models of twin data. Biometrics 1993; 49:

557–568.50. SAS Institute Inc. SAS/STAT Software Version 8. SAS Institute Inc., Cary, NC, USA, 2000. www.sas.com51. Ngo L, Wand MP. Smoothing with mixed model software. Journal of Statistical Software 2004; 9:1–54.52. Crainiceanu CM, Ruppert D, Wand MP. Bayesian analysis for penalized spline regression using WinBUGS.

Journal of Statistical Software 2004, under revision, available at www.people.cornell.edu/pages/cmc59/53. McCullagh P, Nelder J. Generalised Linear Models. Chapman & Hall: London, 1989.54. Aitkin M, Anderson D, Francis B, Hinde J. Statistical Modelling in GLIM. Oxford University Press: Oxford,

1989.55. Breslow N, Clayton DG. Approximate inference in generalised linear mixed models. Journal of the American

Statistical Association 1993; 88:9–25.56. StataCorp. Stata: Release 8.2. Stata Corporation, College Station, TX, USA, 2003. www.stata.com57. Hall I. b2-adrenoreceptor polymorphisms and asthma. Clinical and Experimental Allergy 1999; 29:1151–1154.58. Daniels S, Bhattacharrya S, James A, Leaves N, Young A, Hill M, Faux J, Ryan G, le Souef P, Lathrop M

et al. A genome wide search for quantitative trait loci underlying asthma. Nature 1996; 383:247–250.59. Doull I, Lawrence S, Watson M, Begishvili T, Beasley R, Lampe F, Holgate S et al. Allelic association of

gene markers on chromosomes 5q and 11q with atopy and bronchial hyperresponsiveness. American Journalof Respiratory and Critical Care Medicine 1996; 153:1280–1284.

60. Yan K, Salome C, Woolcock A. Rapid method for measurement of bronchial hyperresponsiveness. Thorax1983; 38:760–765.

61. Scurrah KJ, Palmer LJ, Burton PR. Variance components analysis for pedigree-based censored survival datausing generalized linear mixed models (GLMMs) and Gibbs sampling in BUGS. Genetic Epidemiology 2000;19:127–148.

62. Burton P, Tiller K, Gurrin L, Cookson W, Musk A, Palmer L. Genetic variance components analysis for binaryphenotypes using generalized linear mixed models (GLMMs) and Gibbs sampling. Genetic Epidemiology 1999;17:118–140.

63. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Randomised Trials and HealthcareEvaluation. Wiley: Chichester, 2004.

Copyright ? 2005 John Wiley & Sons, Ltd. Statist. Med. 2005; 24:3361–3381