the Association Between Polymorphic Markers and ...

9
Genetic EpidemioIogy 4: 193-201 (1987) Testing the Association Between Polymorphic Markers and Quantitative Traits in Pedigrees Varghese T. George and Robert C. Elston Department of Biometry and Genetics, .Louisiana State University Medical Center, New Orleans A statistical model that uses an iterative maximum likelihood estimation procedure is proposed for measuring and testing the asmciation Ween polymorhphic genetic markers and quantitative traits in human pdigrees, after adjusting for covariates such as age and sex. The mdel allows the quantitative trait to have a familial correlation structure among the individuah in the sample and to follow one of a broad chs of skewed or kurtotic underlying distributions. The use of the model is illustrated, and the results are compared to those using models that assume normality without any transformation ad do not incorporate fmiIid correlations. Key words: famiIIal correlation, generalized moduhs power transformation, kurtosis, maxlmum likelihd, normality, skewness INTRODUCTION Many disease risk factors appear to be familial, and it is therefore of interest to determine whether a significant portion of the variability of such quantitative traits can be attributed to segregation at a known polymorphic marker locus. It my be of interest, for example, to determine if any part of the population variability of blood pressure can be attributed to segregation at the ABO locus. If this is the only purpose of an investigation, then a random sampIe of individuals from the population can be studied. But if we also wish to ana1yse the genetic mechanism underlying the trait, extended pedigrees provide more generic information than a random sample compris- Received for publication December 1, 1986; revision accepted January 15, 1987. Address reprint requests lo Dr. Varghese T. George, Depanrnen~ of Biometry and Genetics, L.S.U. Medical Center, I901 Perdido Street. New Orleans, LA 70112-1393. O 1987 Alan R. Lh, Inc.

Transcript of the Association Between Polymorphic Markers and ...

Page 1: the Association Between Polymorphic Markers and ...

Genetic EpidemioIogy 4: 193-201 (1987)

Testing the Association Between Polymorphic Markers and Quantitative Traits in Pedigrees

Varghese T. George and Robert C. Elston

Department of Biometry and Genetics, .Louisiana State University Medical Center, New Orleans

A statistical model that uses an iterative maximum likelihood estimation procedure is proposed for measuring and testing the asmciation W e e n polymorhphic genetic markers and quantitative traits in human pdigrees, after adjusting for covariates such as age and sex. The mdel allows the quantitative trait to have a familial correlation structure among the individuah in the sample and to follow one of a broad c h s of skewed or kurtotic underlying distributions. The use of the model is illustrated, and the results are compared to those using models that assume normality without any transformation a d do not incorporate f m i I i d correlations.

Key words: famiIIal correlation, generalized moduhs power transformation, kurtosis, maxlmum l ikel ihd, normality, skewness

INTRODUCTION

Many disease risk factors appear to be familial, and it is therefore of interest to determine whether a significant portion of the variability of such quantitative traits can be attributed to segregation at a known polymorphic marker locus. It m y be of interest, for example, to determine if any part of the population variability of blood pressure can be attributed to segregation at the ABO locus. If this is the only purpose of an investigation, then a random sampIe of individuals from the population can be studied. But if we also wish to ana1yse the genetic mechanism underlying the trait, extended pedigrees provide more generic information than a random sample compris-

Received for publication December 1, 1986; revision accepted January 15, 1987.

Address reprint requests l o Dr. Varghese T. George, Depanrnen~ of Biometry and Genetics, L.S.U. Medical Center, I901 Perdido Street. New Orleans, LA 70112-1393.

O 1987 Alan R. L h , Inc.

Page 2: the Association Between Polymorphic Markers and ...
Page 3: the Association Between Polymorphic Markers and ...

an error or d u d term. The function f i s . m m e d to be known. For example, in the next d o n we assume the regression function

where u is the intercept, 8,yl, ~ 2 , and 73 are che usual regression wfficknts, and gi takes oa the value 1 if a prtkuh marker is prestnt, -1 if absent. The genetic n d c e r value can aIso be the presence of .a p w t h h r genotype, or, m o ~ generally, the term Bgi a n be replaced by it set of terms reflecting the complete set of genotypes at a marker locus.

To induce the assumed normality of the d u d s , we use the generalized mddw power trmsfodon I G E m , proposed by George and Elston [1987], which is defined as

Here y is the random variable whost &duals nad to be normdid, b(y) the & d r a n d o m ~ k , a r d h d 8 a r e t h e m W ~ d r s l t m to be dmatcd ~~y with th-other pmmtm of tbe modei. This -for- mation is basidy a generalization of swwal existing power transformations such as those p p s e d by Box and Cox 119641 and John and Draper I198UJ. GEMPT can r m r r a a l i z e a m u c h ~ ~ o f d i s t r i ~ , withvrtryingdegreesofskmnew a n d I n u t o s i s , ~ c a n ~ p o w e r ~ ~ .

Tbe m&I before transformation is as given in quation 1. If the uamfommtion h is applied only to y, the function f cannot be preserved. So, it is neazisary to apply the transformation to f as WU as to y, as pmposed by h U and Ruppert [1984]. Hence, the d l , after mwhmmth, is given by

w h ~ * ~ i s a s s u m e d t o b e ~ d y d i s t n ~ w i t h ~ z e r o , k , i t i s a a s u a e d ~ the conditional disb-j'bution d hQi) is nwrmti with mem h[f(gI, cl j, . . . & J.

Transhtmhg yi to h(yJ involves a change in scale, and so the likelihood of the . . mhmd data will not be comparable to tbe lrlreWlood of tk w i g d unmans- formed - similarly, the varirrnces of the residuals E, in eqnati00 1 a d < in equation 4 will not lx comparable. Comparability is achieved, however, if the d u a I E: is divided by the Jacobian of the transformation

Page 4: the Association Between Polymorphic Markers and ...

1% Geome and Ektton

where n is the total number of observations yi. Hence, after applying GEMPT and dividing by 1, the residual is given by

and is still normally distributed with mean zero but-now with variance approximately equal to the variance of q in equation 1.

To albw for the familial correlations, the residual is assumed to be the sum of a familid effect and a uni ue random environmental effect, both being normally f distributed with variances f and 4, respectively. As a first approximation, the familial effect is assumed to lead to the same correlations between residuals as would be found if it were due to polygenes. Thus, the residual correlation between a pair of kth degree relatives is taken to be of the form f2[1121k, where

and we propose to caI1 f the familiality. This implies that all first degree relatives (siblings, parent-offspring) have a correlation o f f 12, all second degree relatives (first cousins, uncle-nephew, and so forth) have a correlation of 914, and so on. This definition also assumes that there is no correlation between the familial effects of two spouses and hence no issortative mating*

A fast algorithm proposed by Elston and Stewart [I971], which incorporates this correlation structure, can be used to compute the joint likelihood of the entire pedigree. Using this algorithm, the joint likelihood of the pedigree of n individuals is of the form

L = K exp ( - (112) ~ / l 7 ~ ) ,

where K, N, and T are functions of the two variance components 4 and &, the transformation parameters X and 6, the parameters of the regression function f, and the data. Although it is not feasible to write out explicitly what these functions are, the Elston-Stewart algorithm can be used to calculate them, given particular values of the parameters. The log likelihood of the sample is then linked to a numerical maximum likelihood subroutine to compute the estimates and their standard errors by numerical double differentiation of the log likelihood.

If we can assume that the model is general enough to include the true hypothesis, then the significance of departure from any hypothesis is obtained by comparing with the chi-square distribution the likelihood ratio statistic: 2(maximum log, likelihood under the general mcdel - maximum log, likelihood under the hypothesis). The appropriate number of degrees of freedom is given by the difference between the numbers of prameters estimated when the two likelihoods are maximized. Using this test statistic, the significance of the genetic marker and the other covariates, if needed. can be tested.

AN EXAMPLE

To illustrate the use of our model, we use data collected on a pedigree of 195 individuals spanning six generations, from the ongoing Bogalusa Heart Study [Rosen-

Page 5: the Association Between Polymorphic Markers and ...

Testing Marker Associations in Pedigrees 197

baum et al, f9861, to analyse the association between the group-specif c component (GC) marker locus and total serum cholesterol. The GC marker locus has two alleies (1 and 2) with three genotypes (1 1, 12, and 22). Preliminary analyses were conducted using the usual analysis of variance F test, assuming hat 1) allele 1 is dominant (ie, the effects of 11 and 12 are equal), 2) allele 2 is dominant (ie, the effects of 12 and 22 are equal), and 3) each allele has an additive effect (ie, the effect of 12 is half-way between the effects of 11 and 22). The assumption of dominance of allele 2 yielded the most significant effect, with P = 0.015, and hence only this case is considered for illustrating our modei. The regression function f is as given in equation 2, where, for the ih individual: yi = value of total serum cholesterol in mgldl; g, = allele 2 of the GC marker (1 = present, - 1 = absent); cli = sex (1 = male, - I = female); and czl = deviation of age from the mean (31.451, in years.

The data were andysed using four different methods, all using an iterative maximum likelihood procedure: Method 1: Assuming that the residuals ~ i , as defined in equation 1, are normally distributed, and without incorporating any familial corre- lation structure among the individuals in the pedigree. Method 2: lncorporating the familial correlation structure as explained in the previous section, and assuming that the ~i are normally distributed, Method 3: Without incorporating any familid corre- lation structure, but applying the GEMPT on yi and fi as in equation 4, and dividing both sides of 4 by the Jacobian J given in equation 5 . Here we assume that the transformation induces normaIity, and the residuals tf, as defined in equation 6, are normally distributed. Methd 4: Incorporating familial correlations and the GEMPT, and dividing both sides by the Jacobian. As in method 3, we assume that the residuaIs ~f are normally distributed.

Summaries of the analyses using these four methods are given in Tablm I and 11. It can be seen from TabIe 1 that the P value for the test for significance of GC (2) changes, and the log likelihood value increases substantially as familial correlations andlor GEMPT are incorporated. The largest likelihoods correspond to method 4, where both familial correlations and GEMPT are incorporated. The transformation parameters X and 6 for this method are estimated to be 1.67 and 287.58, respectively, regardless of whether GC(2) is included in the model or not. The estimates of the residua1 farmliality with and without including G.C(2) in the regression model are

TABLE I. Summary of Anatysm Using Methods 1 to 4*

Chi-square M e t M Variables included -2 log likelihod statistic P value AIC

I. a Age, age2, sex b Age, age2, sex. GC(2)

2. a Age, age2, sex b Age, age2, sex, GC(2)

3. a Age, age2, sex b Age, age'. sex, GC(2)

4. a Age. age2, sex b Age, agez, sex. GCQ)

*Method 1: No transformation, and no farniiial corre[ations. Method 2: NO transformation, but familial correlations. Method 3: GEMPT, but no familial correlations. Medrod 4: GEMPT, and familial corretations. a Best model according to AIC.

Page 6: the Association Between Polymorphic Markers and ...

TABLE 11. Rqrwion CoefAdentr and Rmldual Standmrd Dwiath, S-, and K u r t ~ h for the M y & Summadzed in 'hblt I

Regremion coehients Residual

Meth@dn I m e n p t Age, -31.45 (Age. -3 1 .4512 Sex Gc@) sD" S k t w ~ Kunosb

I . a . 200.46 1.73 -0.019 l1.82 - 41.56 0.99 5.10 b 20 1.48 1.74 -0.020 - 1.79 4.40 41.34 0.98 5.04

2. a 198.03 1.59 -0.014 -2.17 44.56 1.03 5.12 b 198.16 1 .B -0.014 -2.15 -1.15 44.40 1.45 5. I5

3. a 192.24 I .49 -0.011 0.67 - 36.82 0.13 3.07 b IP2.94 1.49 -0.012 0.74 -2.88 36.69 0.14 3.04

4 . 1 19O.M 1.35 -0.M18 0.47 - 39.40 0.08 3.02 b 190.37 1.35 -0.008 0.52 - 1 .M1 39.32 0.0% 3.01

%ac firat footme to Table I. b ~ n d a r d deviation of the residual a k r division by the J a d i of Ihc tramformalion.

Page 7: the Association Between Polymorphic Markers and ...

Testing W k e r Associations in PedEgrees 199

83.6% and $+.Q%' respectively. Thus, the familiality that can be attributed to the presence or absence of GC(2) is 0.446, and it is not significant.

The best fitting hypothesis can be taken to be the one that gives rise to the largest likelihood. However, different hypotheses depend on different numben of unknown parameters, and we should expect the Ikelihood to increase with the number of parameters over which it is maximized. To alIow for this fact, Akaike [I9771 suggested that we choose the hypothesis that leads to the smallest value of the information criterion (AIC) , defined as

where P is the natural logarithm of the maximum likelihood and P is the number of independently estimated parameters. The last column in Table I corresponds to the AIC, and it can be seen that m e ~ d 4, without the inclusion of GC(2), gives the best model according to this criterion.

To verify the fact that GEMPT removed skewness and kurtosis and induced approximate normality, we examined the residuals of the various models (Table XT). The coefficients of skewness and kurtosis of the residuals when no transformation is incorporated (method 2a) are 1.03 and 5.12, respectively, and the Kolmogorov test for normality of the residuals is rejected with a P value less than 0.01. The residuals when GEMPT is incorporated (method 4a) have coefficients of skewness 0.08 and kurtosis 3.02, which are very dose to those of a normal distribution (skewness = 0, kurtosis = 3), and the test for normality of the residuals was accepted with a P value greater than 0.15.

When the data were analysed using the analysis of variance F test (without incorporating a transformation or familial correlations) GC(2) was found significant with a P value of 0.015. However, when the maximum likelihood procedure is used to analyse the same data, under the same assumptions of normality and no familial correlations, the asymptotic test based on the likehhmd ratio criterion resulted in a P value of 0.165 for the significance af GC(2) (method 1, Table Q. This discrepancy may be attributed to the fakt that the analysis of variance F test is extremely d t i v e to heavily hrtotic distributions [Scheffi, 19591, whereas the chi-square test based on he likelihood ratio criterion is much more robust.

DISCUSSION

Bmrwinkle et a1 [I9861 have recently proposed an analytical method for the use of "measured genotype infomtion" in nuclear families. They use the term "ma- sured genotype" in the same sense that we have used the term "polymorphic marker," and their m e t h d is s imhr to the one we propose here for pedigrees. However, whereas we have concentrated on testing whether the polymorphic marker effect is significant, their emphasis is more on the separate estimation of b t h this "measured genotype" effect and the contribution of the residual polygenic effects to phenotypic variabilities. The latter they call the "residual polygenic heritability," and their estimate of it is identical to our estimate of "familiality," We prefer the term "farniliality" because there is no guarantee that it is in fact genetic in origin, even though, in incorporating the correlation among the individuals in the pedigree, we

Page 8: the Association Between Polymorphic Markers and ...
Page 9: the Association Between Polymorphic Markers and ...

~ R C , ~ J ( l ~ l ) : A @ m o d c l f w m e g e a a i c a n a t r s i s o f ~ ~ H m H w t d 2 1523-542.

Gewge VT, EIsm RC (1987): G e m d i d nodulus p~wer trrtnsformatiws. S u h h d for pblication. John JA, Draper NR (1980): An akrnative family of trmfomtbns. Appl Stat 29: 1W197. Rosutbaum PA. Amos CI. Shear CL. Elston RC. Mien TA. Srinivasan SR, Bermson (3S (1986):

Description of a large with an dwm lipopmtljn c h o h d m p e : Tbe Bogalw Heart S*..C)enct -3241-253.

&Mi%, H (1959): 7 k Amlysis ofV-" N m York: JahD Wiley & W, h, pp 331-369. SimgCP.OmJD(1W6): Adyskof@adcmhmmxd s m ~ u f ~ ~ s t m r n ~ ~ l

m T s e u m s c h , M i e h i g a a . ~ . I d a r t i t i f n t i o n o f g c n c t i c ~ t l s i q g 1 2 ~ g e a a i c b l a o d marker ~ m . Am J Hum GeruI28:4534.

Edited by D.C. Riw