Statistical methods for haplotype-based matched case-control association studies

11
Genetic Epidemiology 31: 316–326 (2007) Statistical Methods for Haplotype-Based Matched Case-Control Association Studies Han Zhang, 1 Hong Zhang, 1,2 Zhaohai Li, 3,4 and Gang Zheng 5 1 Department of Statistics and Finance, University of Science and Technology of China, PR China 2 Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, Connecticut 3 Department of Statistics, George Washington University, Washington, DC 4 Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland 5 Office of Biostatistics Research, National Heart, Lung and Blood Institute, Bethesda, Maryland Kraft et al. [2005] proposed a method for matched haplotype-based association studies and compared the performances of six analytic strategies for estimating the odds ratio parameters using a conditional likelihood function. Zhang et al. [2006] modified the conditional likelihood and proposed a new method for matched haplotype-based association studies. The main assumptions of Zhang et al. were that the disease was rare, the population was in Hardy-Weinberg equilibrium (HWE), and the haplotypes were independent of the covariates and matching variable(s). In this article, we modify the estimation procedure proposed by Zhang et al. and introduce a fixation index so that the assumption of HWE is relaxed. Using the Wald test, we compare the current modified method with the procedure developed by Kraft et al. through simulations. The results show that the modified method is uniformly more powerful than that described in Kraft et al. Furthermore, the results indicate that the modified method is quite robust to the rare disease assumption. Genet. Epidemiol. 31:316–326, 2007. r 2007 Wiley-Liss, Inc. Key words: bias; estimation; haplotype-analysis; matching; Wald test Contract grant sponsor: NIH grant; Contract grant number: EY014478; Contract grant sponsor: National Natural Science Foundation of China; Contract grant number: 10471136; Contract grant sponsor: Ph.D. Program Foundation of the ministry of Education of China; Contract grant sponsor: Special Foundations of the Chinese Academy of Science; Contract grant sponsor: University of Science and Technology of China. Correspondence to: Hong Zhang, Ph.D., Department of Epidemiology and Public Health, Yale University School of Medicine, 60 College St, New Haven, CT 06510. E-mail: [email protected] Received 8 May 2006; Revised 20 November 2006; Accepted 30 December 2006 Published online 6 February 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.20212 INTRODUCTION Recently developed technologies have made available very abundant single nucleotide polymorphisms (SNPs) [Chee et al., 1996; Wang et al., 1998]. Using densely distributed SNPs on chromosomes, one can map accurately genes influencing complex human diseases. SNP-based association studies for human disease mapping have been shown to be more powerful than linkage-based studies [Risch and Merikangas, 1996; Risch, 2000; Botstein and Risch, 2003]. Cordell and Clayton [2002] developed a SNP- disease association test by treating each SNP as a separate variable and using a stepwise strategy to process all SNPs systematically. However, this method may not be efficient as it does not utilize linkage disequilibrium information among SNPs. It is more appealing to study haplotypes for multiple tightly linked SNPs within candidate genes. The main difficulty in applying haplotype- based association methods is that SNP data often consist of unphased genotypes. Many statistical methods proposed for haplotype analysis have been based on unmatched case-control design. There are two basic likelihoods that have been used in unmatched case-control studies with unphased genotypes, namely, the prospective likelihood [Schaid et al., 2002; Zaykin et al., 2002; Lake et al., 2003; Zhao et al., 2003; Stram et al., 2003b] and the retrospective likelihood [Epstein and Satten, 2003; Satten and Epstein, 2004; Chatterjee and Carroll, 2005; Lin and Zeng, 2006]. Satten and Epstein [2004] and Spinka et al. r 2007 Wiley-Liss, Inc.

Transcript of Statistical methods for haplotype-based matched case-control association studies

Page 1: Statistical methods for haplotype-based matched case-control association studies

Genetic Epidemiology 31: 316–326 (2007)

Statistical Methods for Haplotype-Based MatchedCase-Control Association Studies

Han Zhang,1 Hong Zhang,1,2� Zhaohai Li,3,4 and Gang Zheng5

1Department of Statistics and Finance, University of Science and Technology of China, PR China2Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, Connecticut

3Department of Statistics, George Washington University, Washington, DC4Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland

5Office of Biostatistics Research, National Heart, Lung and Blood Institute, Bethesda, Maryland

Kraft et al. [2005] proposed a method for matched haplotype-based association studies and compared the performances ofsix analytic strategies for estimating the odds ratio parameters using a conditional likelihood function. Zhang et al. [2006]modified the conditional likelihood and proposed a new method for matched haplotype-based association studies. Themain assumptions of Zhang et al. were that the disease was rare, the population was in Hardy-Weinberg equilibrium(HWE), and the haplotypes were independent of the covariates and matching variable(s). In this article, we modify theestimation procedure proposed by Zhang et al. and introduce a fixation index so that the assumption of HWE is relaxed.Using the Wald test, we compare the current modified method with the procedure developed by Kraft et al. throughsimulations. The results show that the modified method is uniformly more powerful than that described in Kraft et al.Furthermore, the results indicate that the modified method is quite robust to the rare disease assumption. Genet. Epidemiol.31:316–326, 2007. r 2007 Wiley-Liss, Inc.

Key words: bias; estimation; haplotype-analysis; matching; Wald test

Contract grant sponsor: NIH grant; Contract grant number: EY014478; Contract grant sponsor: National Natural Science Foundationof China; Contract grant number: 10471136; Contract grant sponsor: Ph.D. Program Foundation of the ministry of Education of China;Contract grant sponsor: Special Foundations of the Chinese Academy of Science; Contract grant sponsor: University of Science andTechnology of China.�Correspondence to: Hong Zhang, Ph.D., Department of Epidemiology and Public Health, Yale University School of Medicine, 60 CollegeSt, New Haven, CT 06510. E-mail: [email protected] 8 May 2006; Revised 20 November 2006; Accepted 30 December 2006Published online 6 February 2007 in Wiley InterScience (www.interscience.wiley.com).DOI: 10.1002/gepi.20212

INTRODUCTION

Recently developed technologies have madeavailable very abundant single nucleotidepolymorphisms (SNPs) [Chee et al., 1996; Wanget al., 1998]. Using densely distributed SNPs onchromosomes, one can map accurately genesinfluencing complex human diseases. SNP-basedassociation studies for human disease mappinghave been shown to be more powerful thanlinkage-based studies [Risch and Merikangas,1996; Risch, 2000; Botstein and Risch, 2003].Cordell and Clayton [2002] developed a SNP-disease association test by treating each SNP as aseparate variable and using a stepwise strategy toprocess all SNPs systematically. However, thismethod may not be efficient as it does not utilize

linkage disequilibrium information among SNPs.It is more appealing to study haplotypes formultiple tightly linked SNPs within candidategenes. The main difficulty in applying haplotype-based association methods is that SNP data oftenconsist of unphased genotypes. Many statisticalmethods proposed for haplotype analysis havebeen based on unmatched case-control design.There are two basic likelihoods that have beenused in unmatched case-control studies withunphased genotypes, namely, the prospectivelikelihood [Schaid et al., 2002; Zaykin et al., 2002;Lake et al., 2003; Zhao et al., 2003; Stram et al.,2003b] and the retrospective likelihood [Epsteinand Satten, 2003; Satten and Epstein, 2004;Chatterjee and Carroll, 2005; Lin and Zeng,2006]. Satten and Epstein [2004] and Spinka et al.

r 2007 Wiley-Liss, Inc.

Page 2: Statistical methods for haplotype-based matched case-control association studies

[2005] compared the two likelihoods. As pointedout by Spinka et al. [2005], the ascertainment-corrected prospective likelihood described byStram et al. [2003b] is equivalent to the retro-spective likelihood if the distributions of genotypesand covariates are completely nonparametric.

Owing to confounding variables, unmatchedcase-control designs may lead to spurious associa-tions. Ignoring the confounders, for example, raceand age, can increase type I error, depending onthe degree of confounding. A widely used strategyto control for confounding variables is matching.Population-based matched case-control design,using unrelated individuals as controls, is anappealing approach. Haplotype-based matchedcase-control design has been used for geneticassociation studies. For example, the NutritionalBranch of the Division of Cancer Epidemiologyand Genetics at the National Cancer Institute isconducting a haplotype-based matched case-con-trol study within the ATBC (Alpha-Tocopherol,Beta-Carotene Cancer Prevention) Study [Heino-nen et al., 1998]. This study will examine theassociation between haplotypes in vitamin Etransport genes and prostate cancer risk. Thesecond example is ‘‘A Case-Control EtiologicStudy of Sarcoidosis’’ (ACCESS), sponsored bythe National Heart, Lung, and Blood Institute[ACCESS Research Group, 1999]. Currently, AC-CESS investigators are conducting haplotypeanalysis of 50 selected SNPs from immuneresponse genes based on a matched case-controldesign. Kraft et al. [2005] analyzed a haplotype-based matched case-control study within theprospective Nurses’ Health Study (NHS) cohort[De Vivo et al., 2002; Spinka et al., 2005] to test theassociation between progesterone receptor gene(PGR) haplotypes and endometrial cancer.

Kraft et al. [2005] applied the conditional logisticregression model to haplotype data in matchedcase-control studies and used six strategies tohandle ambiguity of haplotype phases. On thebasis of the coverage probabilities (CPs), theyidentified the two best approaches, referring tothem as the [EHAP] and the JHAP. The EHAPprocedure uses the conditional expectation of thehaplotype score, given the data, to substitute thediplotype effect in the conditional logistic regres-sion model. The JHAP method approximates thejoint likelihood for haplotype risk parameters andfrequencies by the conditional expectation of theconditional logistic likelihood. These two ap-proaches had similar levels of performance insimulation studies. Assuming the rare disease

hypothesis and Hardy-Weinberg equilibrium(HWE) in the population, Zhang et al. [2006]modified the usual logistic regression model foranalysis of haplotype data from matched case-control studies. This approach improves theefficiency of haplotype analysis.

In this article, we introduce a fixation index[Satten and Epstein, 2004] to the procedures ofZhang et al. [2006] so that the HWE assumption isrelaxed. Then, we compare the performance ofthis modified approach with that of EHAP by theuse of a simulation study, focusing on the Waldtest. In fact, Zhang et al. [2006] noticed that theperformances of the Wald test and the score testwere similar. The simulation results show that theproposed modified approach and the EHAP haveunbiased estimates of odds ratio (OR) parameters,accurate CPs, and good control of type I error.Moreover, the new modified method has greaterpower uniformly than the EHAP in the situationsconsidered in the simulation study.

METHODS

Suppose that n independent cases are sampledfrom the study population and that r (rZ1)controls are matched to each case. Each matchedset consists of r11 individuals. The j-th individualin the i-th matched set is indicated by (i, j), wherej 5 0 for the case and 1rjrr for controls. Thediplotype with phase information and the vectorof covariates for the (i, j)th individual are denotedby hij and xij, respectively. The disease statusvariable dij has di0 ¼ 1 for the i-th case and dij 5 0for the j-th control, jZ1, in the i-th matched set,and we denote the vector of matching variables byci. The corresponding genotype of the (i,j)-thindividual without phase information is denotedby gij. To simplify the presentation, we use notationxi ¼ ðxi0; . . . ; xirÞ; hi ¼ ðhi0; . . . ; hirÞ; gi ¼ðgi0; . . . ; girÞ;and di ¼ ðdi0; . . . ; dirÞ ¼ ð1; 0; . . . ; 0Þ. We furtherassume that the disease prevalence for givendiplotype, covariates, and matching variablesfollows the logistic regression model [Breslowand Day, 1980]:

fij ¼ Pðdij ¼ 1jhij; xij; ciÞ ¼eaiþb

0zðhij;xijÞ

1þ eaiþb0zðhij;xijÞ

;

where z(hij, xij) is a function of hij and xij, b is avector of log OR parameters, and ai is a interceptdependent on the i-th matched set. The likelihoodfunction for matched case-control studies can be

317Matched Haplotype-Based Association Studies

Genet. Epidemiol. DOI 10.1002/gepi

Page 3: Statistical methods for haplotype-based matched case-control association studies

written as

Yn

i¼1

Pðxi; hijdi; ciÞ;

which contains many nuisance parameters relatedto matching variables and covariates. Estimatingthese nuisance parameters is complicated.

A widely used approach for haplotype analysisis to consider the conditional likelihood function[Liddell et al., 1977]:

Yn

i¼1

Pðdijxi; hi; ci;Xr

j¼0

dij ¼ 1Þ ¼Yn

i¼1

eb0zðhi0;xi0ÞPr

j¼0 eb0zðhij;xijÞ

:

ð1Þ

This likelihood function does not involve thenuisance parameters ai and is similar to partiallikelihood function [Cox, 1972]. Since the his areusually not observed, the above likelihood cannotbe used directly. Kraft et al. [2005] applied sixstrategies and showed by simulation studies thattwo of them (EHAP and JHAP) had best CPs forthe estimation of OR parameters. For the EHAPmethod, z(hij, xij) was replaced by its conditionalexpectation, given the observed genotypes. Forthe JHAP procedure, Kraft et al. [2005] directlyapplied the conditional expectation of (1), giventhe genotypes. Both strategies require estimatingthe parameters related to diplotypes. Kraft et al.[2005] estimated the haplotype frequencies forboth cases and controls by assuming HWE in thepooled sample. To estimate the variances of theestimates of ORs, they did not adjust for the extravariances in the estimates attributable to theestimated nuisance parameters. Both methodshave some limitations. First, estimating haplotypefrequencies for both cases and controls may resultin a biased estimate, as HWE usually does nothold for cases. Second, the test statistics may bebiased by directly substituting for the unobservedcovariates z(hij, xij) with their conditional expecta-tions or by taking the conditional expectation of(1). In the following simulation study, we modifytheir variances of the estimates to accommodatethe estimation of haplotype frequencies. Themodification of the procedure is described brieflyat the end of this section. Then we conduct asimulation study on the performance of the Waldtest for EHAP, using the estimated variance withthe modification. Although JHAP had a perfor-mance similar to that of EHAP in Kraft et al.[2005], it is more complicated to modify itsestimated variance. Therefore we do not consider

JHAP in the simulation using the Wald test. Theconditional likelihood function (1) may lose someinformation, as studied in Pike et al. [1980]. Thefollowing proposed likelihood, however, canreduce information loss, as it lies in between thefull likelihood function and the conditional like-lihood function. In fact, the full likelihoodinvolves most of the parameters, whereas theconditional likelihood involves the lowest numberof parameters (OR parameters). Thus, our mod-ified likelihood is ‘‘in between’’ in the sense of thenumber of parameters.

Zhang et al. [2006] proposed using likelihoodfunction

Yn

i¼1

Pðhi; dijxi;Xr

j¼0

dij ¼ 1; ciÞ; ð2Þ

which modeled the disease status and haplotypedata jointly. Under the rare disease assumption,the disease prevalence fij was approximated byexpfai þ b0zðhi0; xi0ÞÞg for j 5 0, and by 1 for j>0.With the added assumption that the diplotypesare independent of the matching variable andthe covariates, the likelihood function (2)was approximated by (see eq. (7) in Zhang et al.[2006])

Yn

i¼1

eb0zðhi0;xi0Þ

Qrj¼0 PðhijÞP

h

Prj¼0 eb

0zðhj;xijÞPðhjÞ; ð3Þ

where Sh is taken over all possible diplotype setsh 5 (h0, y, hr). This likelihood function does notinvolve the nuisance parameters ai [refer to Zhanget al. [2006] for the detailed derivation]. Thecorresponding likelihood function for theobserved data fgi; xi; cig

ni¼1 and disease status was

approximated by

Yn

i¼1

Pðgi; dijxi;Xr

j¼0

dij ¼ 1; ciÞ ¼Yn

i¼1Ph

eb0zðh0;xi0ÞPðh0ÞIðh0 2 Sðgi0ÞÞ

Qrj¼1 PðgijÞP

h

Prj¼0 eb

0zðhj;xijÞPðhjÞ;

ð4Þ

where S(g) is the set containing all the diplotypesthat are compatible with genotype g. The abovelikelihood function contains unknown parametersrelated to diplotypes. It is natural to estimate theunknown parameters by maximizing the abovelikelihood. However, this is very computation-intensive. Instead, Zhang et al. [2006] proposedestimating the unknown parameters using thecontrol data only, that is, by maximizing the

318 Zhang et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 4: Statistical methods for haplotype-based matched case-control association studies

likelihood function for control genotypes

Yn

i¼1

Yr

j¼1

PðgijÞ: ð5Þ

Assuming HWE in the population and jointlyconsidering the likelihoods (4) and (5), theyobtained a new Wald-type test.

In the situation where HWE does not hold, weintroduce a fixation index F as in Satten andEpstein [2004]:

Fð½h1; h2�Þ ¼FPðh1Þ þ ð1� FÞðPðh1ÞÞ

2; for h1 ¼ h2;ð1� FÞPðh1ÞPðh2Þ; for h1 6¼ h2;

�ð6Þ

where [h1, h2] is an ordered diplotype. The Wald-type test of Zhang et al. [2006] can be extended toincorporate the fixation index. The unknownparameters involved in likelihoods (4) and (5)are the OR parameters b, haplotype frequencies p,and fixation index F. Let y ¼ ðbT; pT; FÞT. Denoteby L1(y) and L2(y) the likelihood functions (4) and(5), respectively. Then maximizing them is equiva-lent to solving the estimating equationsUðyÞ ¼ ðUT

b ðyÞ;UTp ðyÞ;U

TF ðyÞÞ

T¼ 0, where UbðyÞ ¼

q log L1ðyÞ=qb, UpðyÞ ¼ q log L2ðyÞ=qp, and UFðyÞ ¼q log L2ðyÞ=qF. Let the resulting estimate of y be y.The covariance matrix of y can be estimated by

� � ðn� 1Þ�1n2 S�1ðyÞn o

Xn

i¼1

½UiðyÞ � �UðyÞ�½UiðyÞ � �UðyÞ�T( )

S�1ðyÞn oT

;

where SðyÞ ¼ qUðyÞ=qyT; Ui(y) is the score con-tribution of the i-th matched set, and

�UðyÞ ¼ n�1Xn

i¼1UiðyÞ:

Then the usual Wald test and generalized scoretest [Boos, 1992] can be applied. The simula-tion results of Zhang et al. [2006] showed thatthe two tests had similar behavior, so we focuson the Wald test in the rest of this section.However, simulation results using the generalizedscore test are also presented at the end of theResults section.

When conducting the Wald test, we need tocalculate the derivative matrix SðyÞ of the scorevector. In practice, many haplotype frequenciesmight be estimated as zero or very close to zero,under which SðyÞ could be singular and the Waldtest would fail. To overcome this problem, Zhanget al. [2006] pooled those haplotypes as a newhaplotype. Their simulation results showed that

pooling haplotypes with small frequencies mayresult in biased results, especially when the effectof the diplotype is large. We suggest anotherprocedure here: When calculating the derivativematrix, we simply delete those entries related tohaplotypes with small estimated frequencies. Thiswould have little impact on the estimation of ORparameters b and testing haplotypes. In addition,when the estimate of F is either o0 or 4 1, we setthe resulting estimate of F to 0 or 1, and the otherparameters are estimated using F 5 0 or 1,respectively. The procedure for estimating thecovariance matrix of the estimated parameterswith the EHAP method is the same as above,except that it is based on likelihood functions (1)and (5), instead of (4) and (5).

RESULTS

A simulation study was conducted to comparethe performances of the EHAP procedure of Kraftet al. [2005] and the proposed modified method(NHAP). The simulation data were generatedaccording to the structure of the publishedFUSION data [Valle et al., 1998; Epstein andSatten, 2003]. Table I lists the haplotypes pertain-ing to five tightly linked SNPs on chromosome 22found in the control sample of the FUSION data,together with their probabilities and certaintymeasures R2

h [Stram et al., 2003a]. We first focusedon haplotype 01100. The underlying diseaseprevalence was set to expðai þ bzðhijÞÞ=f1þ exp

ðaiþbzðhijÞÞg with hij¼ðhð1Þij ; h

ð2Þij Þ, zðhijÞ¼Iðhð1Þij ¼

TABLE I. Frequencies and certainty measuresof haplotypes

Haplotype Frequency Rh2

01100 0.2514 0.73800011 0.0042 0.38800110 0.0018 0.37701111 0.0019 0.32510011 0.3574 0.72710110 0.0317 0.55411100 0.0110 0.26600100 0.0035 0.33601011 0.1292 0.59201101 0.0012 0.45010000 0.0136 0.45610100 0.0520 0.40211011 0.1391 0.56011111 0.0020 0.33801100 0.2514 0.738

R2h, certainty measure.

319Matched Haplotype-Based Association Studies

Genet. Epidemiol. DOI 10.1002/gepi

Page 5: Statistical methods for haplotype-based matched case-control association studies

01100Þ1Iðhð2Þij ¼01100Þþs0Iðhð1Þij ¼01100; hð2Þij ¼01100Þ,where s0 5�1, 0, and 1 correspond to the threemodes of inheritance: recessive, additive, anddominant, respectively. For each situation, one ortwo controls were matched on the case value of ai,with the ai selected from the uniform distributionU [log 10�2–log 2, log 10�21log 2], so that thebaseline prevalence was about ai 5 10�2. The ORparameter eb was set to 1, 1.25, or 1.5.

Assuming HWE in the population, we gener-ated 5,000 replicates of 100 or 200 matched sets.The EHAP and NHAP procedures were appliedto analyze the simulated data. We deleted thosehaplotypes with estimated frequencies of less than10–4 when calculating the derivative matrix of thescore vector. The resulting estimates of the ORparameter, type I error, and power are reported inTables II and III for 1:1 matching and 1:2 matching,respectively. In the tables, the headings Bias, SD,MSD, and Wald stand for the mean estimatesminus the true value, the mean standard devia-tion, the mean estimated standard deviation, andthe size or power of the Wald test, respectively.

The nominal CP is 95% and the nominal sig-nificance level is 5%. For additive and dominantmodels, the estimates of ORs in both approachesare quite unbiased, with an absolute value of biasof no more than 0.008. For NHAP, the meanestimated standard deviations are quite close tothe simulated values, with the absolute differenceno more than 0.003; the CPs and type I errors arealso close to their nominal values, with CPvarying from 0.941 to 0.954 and the type I errorvarying from 0.049 to 0.053. For EHAP, the meanestimated standard deviations are slightly largerthan the simulated values; for example, thedifference between them is 0.024 under thedominant model with 1:1 matching, ea 5 1.5, andsample size 5 100. The CPs are slightly larger thanthe nominal values, which vary from 0.948 to0.975; however, the type I errors are virtually thesame as the nominal values, which vary from0.046 to 0.052. The performances of the twomethods under the recessive model are slightlyless efficient in terms of CP and type I error. Butthe performances are substantially improved by

TABLE II. Summary of estimates of b and size or power of NHAP and EHAP when HWE holds, for a rare disease

eb

NHAP EHAP

Bias SD MSD CP Wald Bias SD MSD CP Wald

Sample size 5 100Dominant

1.00 �0.003 0.144 0.145 0.951 0.049 �0.004 0.151 0.152 0.953 0.0471.25 0.000 0.145 0.143 0.941 0.343 0.002 0.159 0.168 0.962 0.2141.50 �0.002 0.143 0.146 0.953 0.805 0.006 0.163 0.187 0.971 0.631

Additive1.00 �0.000 0.239 0.239 0.950 0.050 �0.000 0.245 0.244 0.953 0.0471.25 �0.007 0.234 0.233 0.950 0.150 �0.002 0.244 0.247 0.958 0.1101.50 �0.006 0.229 0.229 0.949 0.414 0.003 0.243 0.254 0.959 0.324

Recessive1.00 �0.038 0.269 0.261 0.964 0.036 �0.005 0.333 0.323 0.979 0.0211.25 �0.028 0.231 0.226 0.956 0.169 0.011 0.304 0.295 0.972 0.0861.50 �0.010 0.210 0.206 0.951 0.484 0.019 0.287 0.284 0.971 0.291

Sample size 5 200Dominant

1.00 �0.001 0.100 0.102 0.947 0.053 �0.001 0.106 0.106 0.949 0.0511.25 �0.000 0.102 0.101 0.948 0.598 0.001 0.110 0.118 0.962 0.4691.50 �0.001 0.102 0.102 0.951 0.978 0.006 0.113 0.131 0.975 0.949

Additive1.00 �0.001 0.168 0.168 0.948 0.052 �0.001 0.171 0.169 0.950 0.0501.25 �0.002 0.162 0.164 0.951 0.266 �0.000 0.166 0.172 0.959 0.2261.50 �0.003 0.162 0.161 0.948 0.710 0.003 0.169 0.177 0.961 0.664

Recessive1.00 �0.026 0.184 0.179 0.957 0.043 �0.010 0.222 0.216 0.958 0.0421.25 �0.010 0.157 0.157 0.951 0.299 0.011 0.205 0.200 0.955 0.1891.50 �0.012 0.145 0.144 0.951 0.780 0.010 0.194 0.192 0.956 0.598

Bias, mean estimates minus the true value; SD, mean standard deviation; MSD, mean estimated standard deviation; CP, 95% coverageprobability; Wald, size or power of the Wald test at 0.05 level of significance; HWE, Hardy-Weinberg equilibrium. Target haplotype 5 01100,ea 5 10�2, 1:1 matching.

320 Zhang et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 6: Statistical methods for haplotype-based matched case-control association studies

increasing the sample size from 100 to 200. Forexample, the type I error of NHAP steadilyincreases from 0.036 to 0.043 with 1:2 matching.In terms of the standard deviation and power, theresults also show that the performance of NHAPis uniformly better than EHAP. For example,under the recessive model with 1:1 matching,sample size 5 100, and eb 5 1.25, the ratio of thestandard deviation of NHAP to that of EHAP is0.76. Also the relative efficiency of NHAPcompared with EHAP, defined as the ratio of thepower of NHAP to that of EHAP, can be as largeas 1.7.

As shown by the simulation in Zhang et al.[2006], the Wald test was not robust to departurefrom HWE. With the use of a fixation indexF 5 0.05 or 0.15 when generating the diplotypes,NHAP and EHAP can be compared using thesimulated data. The results with the same targethaplotype 01100 and sample size 5 200 are re-ported in Tables IV and V for 1:1 matching and 1:2matching, respectively. It is seen that EHAP is

somewhat more robust to the departure fromHWE, since the CPs are only slightly larger thanthe nominal ones, which vary from 0.946 to 0.978.The NHAP, with the incorporation of the fixationindex, has CPs much closer to the nominal ones,which vary from 0.944 to 0.957. Also, the NHAP ismore powerful than the EHAP. The relativeefficiency can be as large as 1.4 under therecessive model with F 5 1.5, eb 5 1.25, and 1:1matching.

Additional simulation studies were conductedto examine the performances of the two methodsunder a common disease situation with ea 5 10�1

for the same target haplotype 01100. The results ofthese simulation studies are given in Table VI,showing that the NHAP is still uniformly morepowerful, though the superiority over the EHAPdecreases compared with that for a rarer disease(ea 5 10�2). For the situations considered in thesimulation study, the relative efficiencies forea 5 10�1 are about 90.4–104.0% of those forea 5 10�2. A rare haplotype was also considered.

TABLE III. Summary of estimates of b and size or power of NHAP and EHAP when HWE holds, for a rare disease

eb

NHAP EHAP

Bias SD MSD CP Wald Bias SD MSD CP Wald

Sample size 5 100Dominant

1.00 �0.001 0.124 0.126 0.949 0.051 �0.001 0.128 0.130 0.954 0.0461.25 �0.001 0.125 0.125 0.952 0.429 0.001 0.131 0.140 0.963 0.3421.50 �0.002 0.127 0.128 0.951 0.888 0.003 0.136 0.152 0.967 0.829

Additive1.00 �0.003 0.209 0.206 0.948 0.052 �0.002 0.211 0.208 0.950 0.0501.25 �0.008 0.198 0.199 0.949 0.197 �0.005 0.202 0.207 0.956 0.1691.50 �0.002 0.191 0.194 0.954 0.564 0.004 0.198 0.209 0.966 0.516

Recessive1.00 �0.035 0.248 0.241 0.970 0.030 �0.015 0.278 0.270 0.969 0.0311.25 �0.023 0.208 0.203 0.957 0.213 �0.004 0.244 0.238 0.963 0.1521.50 �0.018 0.182 0.181 0.959 0.594 0.006 0.223 0.222 0.964 0.476

Sample size 5 200Dominant

1.00 �0.002 0.088 0.089 0.949 0.051 �0.002 0.091 0.091 0.948 0.0521.25 0.000 0.089 0.088 0.947 0.713 0.002 0.093 0.098 0.963 0.6501.50 �0.000 0.090 0.090 0.950 0.995 0.005 0.096 0.106 0.972 0.992

Additive1.00 �0.003 0.146 0.145 0.946 0.054 �0.003 0.147 0.146 0.948 0.0521.25 �0.001 0.142 0.140 0.946 0.353 0.000 0.143 0.145 0.953 0.3291.50 �0.002 0.137 0.137 0.948 0.837 0.001 0.142 0.146 0.957 0.817

Recessive1.00 �0.018 0.166 0.165 0.958 0.042 �0.007 0.183 0.184 0.960 0.0401.25 �0.013 0.142 0.141 0.954 0.356 �0.003 0.165 0.164 0.950 0.2801.50 �0.011 0.128 0.126 0.954 0.856 0.002 0.151 0.153 0.958 0.779

Bias, mean estimates minus the true value; SD, mean standard deviation; MSD, mean estimated standard deviation; CP, 95% coverageprobability; Wald, size or power of the Wald test at 0.05 level of significance; HWE, Hardy-Weinberg equilibrium. Target haplotype 5 01100,ea 5 10�2, 1:2 matching.

321Matched Haplotype-Based Association Studies

Genet. Epidemiol. DOI 10.1002/gepi

Page 7: Statistical methods for haplotype-based matched case-control association studies

TABLE IV. Summary of estimates of b and size or power of NHAP and EHAP when HWE is violated, for a rare disease

F

NHAP EHAP

eb Bias SD MSD CP Wald Bias SD MSD CP Wald

Dominant0.05 1.00 0.001 0.103 0.103 0.951 0.049 �0.003 0.107 0.106 0.946 0.054

1.25 �0.001 0.102 0.102 0.953 0.583 0.003 0.108 0.118 0.965 0.4841.50 �0.003 0.104 0.104 0.952 0.976 0.007 0.114 0.131 0.978 0.948

0.15 1.00 �0.002 0.103 0.104 0.952 0.048 �0.003 0.107 0.107 0.950 0.0501.25 �0.001 0.104 0.103 0.948 0.574 0.000 0.108 0.118 0.966 0.4621.50 �0.001 0.104 0.104 0.948 0.979 0.009 0.114 0.132 0.978 0.956

Additive0.05 1.00 �0.008 0.165 0.164 0.950 0.050 �0.006 0.165 0.165 0.952 0.048

1.25 �0.002 0.159 0.153 0.953 0.282 0.001 0.162 0.168 0.957 0.2341.50 �0.000 0.156 0.156 0.954 0.742 0.001 0.162 0.172 0.961 0.685

0.15 1.00 0.002 0.157 0.156 0.953 0.047 �0.008 0.160 0.157 0.948 0.0521.25 0.000 0.151 0.150 0.950 0.321 0.003 0.156 0.159 0.958 0.2701.50 �0.002 0.149 0.147 0.946 0.783 0.002 0.155 0.164 0.964 0.731

Recessive0.05 1.00 �0.010 0.177 0.174 0.955 0.045 �0.003 0.202 0.201 0.963 0.037

1.25 �0.008 0.158 0.155 0.955 0.308 0.003 0.191 0.186 0.956 0.2131.50 �0.007 0.145 0.144 0.949 0.795 0.010 0.180 0.181 0.961 0.664

0.15 1.00 �0.003 0.158 0.157 0.952 0.048 0.004 0.182 0.179 0.956 0.0441.25 0.004 0.146 0.142 0.944 0.376 0.008 0.169 0.167 0.954 0.2661.50 0.000 0.134 0.132 0.952 0.860 0.006 0.163 0.163 0.958 0.743

Bias, mean estimates minus the true value; SD, mean standard deviation; MSD, mean estimated standard deviation; CP, 95% coverageprobability; Wald, size or power of the Wald test at 0.05 level of significance; HWE, Hardy-Weinberg equilibrium. Target haplotype 5 01100,ea 5 10�2, 1:1 matching, sample size 5 200.

TABLE V. Summary of estimates of b and size or power of NHAP and EHAP when HWE is violated for a rare disease

F eb

NHAP EHAP

Bias SD MSD CP Wald Bias SD MSD CP Wald

Dominant0.05 1.00 �0.002 0.088 0.090 0.953 0.047 �0.002 0.090 0.091 0.954 0.046

1.25 0.001 0.088 0.089 0.951 0.714 0.002 0.092 0.098 0.965 0.6571.50 �0.000 0.091 0.090 0.949 0.996 0.003 0.094 0.106 0.971 0.992

0.15 1.00 �0.002 0.089 0.090 0.955 0.045 �0.000 0.093 0.091 0.946 0.0541.25 �0.001 0.089 0.089 0.951 0.708 0.001 0.092 0.098 0.963 0.6531.50 �0.002 0.092 0.090 0.944 0.995 0.005 0.096 0.106 0.974 0.994

Additive0.05 1.00 �0.003 0.140 0.141 0.953 0.047 �0.003 0.140 0.142 0.953 0.047

1.25 �0.001 0.137 0.136 0.948 0.376 0.000 0.137 0.141 0.959 0.3511.50 �0.000 0.133 0.133 0.951 0.853 0.002 0.137 0.142 0.965 0.836

0.15 1.00 �0.003 0.135 0.135 0.949 0.051 �0.003 0.135 0.135 0.950 0.0501.25 �0.001 0.127 0.128 0.950 0.417 0.001 0.130 0.134 0.958 0.3831.50 �0.003 0.127 0.125 0.945 0.895 �0.001 0.131 0.135 0.958 0.878

Recessive0.05 1.00 �0.010 0.161 0.157 0.955 0.045 �0.012 0.175 0.172 0.954 0.046

1.25 �0.009 0.136 0.136 0.957 0.380 �0.001 0.155 0.154 0.954 0.3101.50 �0.008 0.126 0.124 0.948 0.879 0.003 0.144 0.145 0.959 0.819

0.15 1.00 �0.008 0.142 0.142 0.954 0.046 �0.006 0.154 0.153 0.956 0.0441.25 �0.004 0.125 0.124 0.951 0.447 0.000 0.139 0.139 0.955 0.3771.50 �0.004 0.115 0.114 0.952 0.926 �0.001 0.133 0.132 0.947 0.880

Bias, mean estimates minus the true value; SD, mean standard deviation; MSD, mean estimated standard deviation; CP, 95% coverageprobability; Wald, size or power of the Wald test at 0.05 level of significance; HWE, Hardy-Weinberg equilibrium. Target haplotype 5 01100,ea 5 10�2, 1:2 matching, sample size 5 200.

322 Zhang et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 8: Statistical methods for haplotype-based matched case-control association studies

The results for haplotype 10100 (with frequency0.052 and R2

h ¼ 0:402) under the additive modelare in Table VII. The results for haplotype 10100are similar to those for haplotype 01100.

The efficiencies of NHAP relative to EHAP forhaplotype 01100 are presented in [Figure 1] for 12

configurations of various underlying parameterswith sample size 200. Configurations 1–6 are for1:1 matching, whereas configurations 7–12 are for1:2 matching; configurations 1, 3, 5, 7, 9, 11 useeb 5 1.25, while configurations 2, 4, 6, 8, 10, 12 seteb 5 1.5; configurations 1, 2, 7, 8 assume the

TABLE VI. Summary of b and size or power of NHAP and EHAP for common disease

F eb

NHAP EHAP

Bias SD MSD CP Wald Bias SD MSD CP Wald

Dominant0.05 1.00 �0.003 0.089 0.089 0.955 0.045 �0.001 0.089 0.091 0.957 0.043

1.25 �0.004 0.089 0.089 0.951 0.695 0.001 0.091 0.098 0.967 0.6521.50 �0.006 0.090 0.090 0.948 0.994 0.005 0.094 0.106 0.976 0.993

0.15 1.00 �0.002 0.092 0.090 0.947 0.053 �0.002 0.091 0.091 0.949 0.0511.25 �0.002 0.088 0.089 0.955 0.702 0.001 0.092 0.098 0.965 0.6491.50 �0.004 0.090 0.090 0.948 0.993 0.006 0.095 0.106 0.969 0.993

Additive0.05 1.00 �0.003 0.140 0.141 0.951 0.049 �0.004 0.143 0.142 0.949 0.051

1.25 �0.002 0.137 0.137 0.947 0.372 �0.003 0.139 0.142 0.953 0.3341.50 �0.005 0.133 0.134 0.950 0.848 0.002 0.138 0.144 0.962 0.826

0.15 1.00 �0.002 0.136 0.134 0.944 0.056 �0.001 0.136 0.135 0.946 0.0541.25 �0.005 0.129 0.129 0.949 0.407 �0.003 0.134 0.135 0.952 0.3701.50 �0.002 0.127 0.126 0.945 0.892 0.003 0.134 0.137 0.957 0.865

Recessive0.05 1.00 �0.010 0.164 0.158 0.947 0.053 �0.006 0.177 0.172 0.951 0.049

1.25 �0.022 0.138 0.139 0.954 0.338 �0.000 0.158 0.157 0.955 0.3051.50 �0.028 0.129 0.128 0.948 0.826 0.001 0.151 0.151 0.954 0.777

0.15 1.00 �0.006 0.145 0.142 0.952 0.048 �0.006 0.153 0.154 0.958 0.0421.25 �0.015 0.126 0.126 0.953 0.396 �0.004 0.143 0.142 0.956 0.3491.50 �0.025 0.116 0.117 0.947 0.888 0.000 0.136 0.137 0.956 0.854

Bias, mean estimates minus the true value; SD, mean standard deviation; MSD, mean estimated standard deviation; CP, 95% cover-age probability; Wald, size or power of the Wald test at 0.05 level of significance; Target haplotype 5 01100, ea 5 10�1, 1:2 matching, samplesize 5 200.

TABLE VII. Summary of b and size or power of NHAP and EHAP with a rare target haplotype 10100

F eb

NHAP EHAP

Bias SD MSD CP Wald Bias SD MSD CP Wald

Sample size 5 2000.05 1.00 �0.007 0.300 0.301 0.960 0.040 �0.005 0.313 0.305 0.954 0.046

1.25 �0.006 0.282 0.279 0.951 0.141 �0.000 0.284 0.286 0.960 0.1211.50 0.009 0.267 0.263 0.947 0.375 0.008 0.275 0.274 0.957 0.338

0.15 1.00 �0.009 0.292 0.282 0.949 0.051 �0.003 0.294 0.288 0.954 0.0461.25 �0.004 0.262 0.256 0.948 0.165 0.004 0.269 0.267 0.956 0.1401.50 �0.001 0.243 0.239 0.951 0.426 0.007 0.257 0.255 0.955 0.374

Sample size 5 5000.05 1.00 0.004 0.190 0.188 0.950 0.050 �0.005 0.189 0.189 0.950 0.050

1.25 0.000 0.176 0.174 0.953 0.273 0.005 0.178 0.177 0.949 0.2591.50 0.002 0.163 0.164 0.950 0.694 0.001 0.170 0.170 0.956 0.672

0.15 1.00 �0.005 0.179 0.176 0.950 0.050 �0.005 0.177 0.178 0.954 0.0461.25 �0.002 0.162 0.160 0.948 0.303 0.004 0.166 0.165 0.953 0.2911.50 0.001 0.154 0.149 0.945 0.768 0.002 0.159 0.158 0.953 0.736

Bias, mean estimates minus the true value; SD, mean standard deviation; MSD, mean estimated standard deviation; CP, 95% coverageprobability; Wald, size or power of the Wald test at 0.05 level of significance; ea 5 10�2, 1:2 matching, additive model.

323Matched Haplotype-Based Association Studies

Genet. Epidemiol. DOI 10.1002/gepi

Page 9: Statistical methods for haplotype-based matched case-control association studies

dominant model, configurations 3, 4, 9, 10 assumethe additive model, and configurations 5, 6, 11, 12assume the recessive model. It is shown that therelative efficiencies are greater for the 1:1 match-ing than for the 1:2 matching, and greater forsmall genetic effects than for large genetic effects.Moreover, the relative efficiency is slightly re-duced when HWE does not hold. As the modifiedlikelihood is between the conditional likelihood ofKraft et al. [2005] and the full likelihood functionfor the matched case-control data, it is notsurprising that our results are generally morepowerful than those obtained by the EHAP.

In the above simulations, the fixation index wasassumed to be the same for all possible pairs of

haplotypes. Satten and Epstein [2004] studied thesituation where each fixation index (for eachpossible pair of haplotypes) was different; theyfound that their proposed test (incorporatingthe fixation index) was quite robust. We simulateddata under a similar parameter setting; that is,the fixation index F followed a uniform distributionon [0, 1]. The corresponding summary statisticsfor both the Wald test and the generalized score testare presented in Table VIII. It is seen that theproposed tests are quite robust to the assumptionof a common fixation index, which agrees withthe finding of Satten and Epstein [2004]. For NHAP,the generalized score test is slightly less powerfulthan the Wald test; for EHAP, the Wald test isslightly more powerful under the recessive modeland vice versa under the other two modes ofinheritance.

DISCUSSION

Kraft et al. [2005] and Zhang et al. [2006]developed statistical methods for analysis ofhaplotype association with a disease under amatched case-control design. The methods pro-posed by Kraft et al. utilized conditional logisticlikelihood and applied six strategies to handle theambiguity of haplotype using pooled data fromcases and controls. Kraft et al. did not adjust forthe extra variance of estimated OR parameters.Their method could potentially suffer from bothpower loss and type I owing inflation owing toover dispersion. Zhang et al. [2006] proposed to

TABLE VIII. Summary of b and size or power of NHAP and EHAP (Wald and generalized score tests) when fixationindexes follows the uniform distribution on [0, 1]

eb

NHAP EHAP

Bias SD MSD CP Wald Score Bias SD MSD CP Wald Score

Dominant1.00 �0.002 0.093 0.094 0.955 0.045 0.046 �0.002 0.095 0.095 0.946 0.054 0.0541.25 �0.001 0.092 0.091 0.947 0.681 0.673 0.000 0.094 0.099 0.962 0.630 0.7041.50 �0.001 0.092 0.090 0.946 0.994 0.993 0.003 0.094 0.107 0.973 0.991 0.995

Additive1.00 �0.008 0.120 0.118 0.952 0.048 0.052 �0.001 0.121 0.119 0.949 0.051 0.0511.25 0.000 0.114 0.112 0.946 0.523 0.500 �0.001 0.114 0.118 0.957 0.483 0.5151.50 0.002 0.110 0.108 0.952 0.955 0.949 0.001 0.112 0.119 0.964 0.943 0.952

Recessive1.00 �0.005 0.124 0.120 0.948 0.052 0.059 �0.002 0.131 0.127 0.949 0.051 0.0521.25 �0.005 0.110 0.108 0.948 0.548 0.494 �0.003 0.119 0.117 0.951 0.502 0.4711.50 �0.000 0.102 0.101 0.952 0.965 0.950 0.001 0.111 0.112 0.955 0.946 0.939

Bias, mean estimates minus the true value; SD, mean standard deviation; MSD, mean estimated standard deviation; CP, 95% coverageprobability; Wald, size or power of the Wald test at 0.05 level of significance; Target haplotype 5 01100, ea 5 10�2, 1:2 matching, samplesize 5 200.

Fig. 1. The efficiency of NHAP, relative to that of EHAP, forselected fixation indexes F=0.0, 0.05, and 0.15.

324 Zhang et al.

Genet. Epidemiol. DOI 10.1002/gepi

Page 10: Statistical methods for haplotype-based matched case-control association studies

model the genotype and disease status jointly,assuming the rare disease hypothesis, HWE inthe population, and the independence of covari-ates, matching variable(s) and haplotypes. Theyused a pooling strategy to handle any possiblesingularity of the covariance matrix of estimatedparameters. Since Zhang et al. [2006] used amodified likelihood function, it is expectedthat their method would have improvedefficiency compared with EHAP, the methodproposed by Kraft et al. [2005]. In this article,we suggest simply deleting the entries in thecovariance matrix of estimated parameters relatedto haplotypes with very small estimated frequen-cies. Using our proposed estimated covariancematrix, we compared the approach of Zhang et al.[2006] with one of the two best approaches inKraft et al. [2005] by simulation studies undera broad range of situations. In addition, wemodified the approach of Zhang et al. [2006] byintroducing a fixation index to remove theassumption of HWE. The simulation resultssuggested that the modified approach of Zhanget al. [2006] has a better performance in mostsituations.

We have focused on a single haplotype effect inthis study. For multiple haplotypes, multiplecomparisons and sparse data would be seriousissues. Schaid [2004] proposed a prospectivegeneral linear model, a fixed effects model, anda random effects model/variance componentmodel for handling multiple haplotypes. Thisstrategy can be applied to matched case-controldesigns. However, this strategy can only test theoverall null hypothesis—that all haplotypes haveno effect on the disease, and it cannot be used toidentify specific haplotypes that are responsiblefor the disease. The variable selection methods,such as forward, backward, and stepwise selectionbased on the Akaike or Bayesian informationcriterion [Akaike, 1972; Schwartz, 1986], might beused, since our method is likelihood-based.Furthermore, some other techniques such asLASSO [Tibshirani, 1996] might be applied.The problems of multiple comparison and sparsedata related to haplotype-based association stu-dies need further investigation.

Although matching is a sound strategy forcontrolling confounders, it might be imperfectin some situations. Lee [2004] used genomiccontrol in matched case-control studies withsingle SNP data. In the multiple-SNP case,genomic control for matched case-control designneeds further study.

ACKNOWLEDGMENTS

The work of Hong Zhang was partially sup-ported by the National Natural Science Founda-tion of China (10471136), the Ph.D. ProgramFoundation of the Ministry of Education of China,and the Special Foundations of the ChineseAcademy of Science and University of Scienceand Technology of China. The work of Zhaohai Liwas supported in part by NIH grant EY014478. Wewould like to thank Dr. B. J. Stone for editorialhelp.

REFERENCESACCESS Research Group. 1999. Design of a case control etiologic

study of sarcoidosis (ACCESS). J Clinical Epidemiol 52:

1173–1186.Akaike H. 1972. Information theory and an extension of maximum

likelihood principle. Proceedings of the Second International

Symposium on Information Theory. Suppl. to Problems of

Control and Information Theory. p 267–281.Botstein D, Risch N. 2003. Discovering genotypes underlying

human phenotypes: past successes for Mendelian disease,

future approaches for complex disease. Nat Genet 33:228–237.Breslow N, Day N. 1980. Statistical Methods in Cancer Research.

Lyon, France: IARC Scientific Publications.Chatterjee N, Carroll R. 2005. Semiparametric maximum-like-

lihood estimation exploiting gene-environment independence

in case-control studies. Biometrika 92:399–418.Chee M, Yang R, Hubbell E, Huang X, Stern D, Winkler J, Lockhart

D, Morris M, Fodor S. 1996. Accessing genetic information

with high-density DNA arrays. Science 274:610–614.Cordell H, Clayton D. 2002. A unified stepwise regression

procedure for evaluating the relative effects of polymorphisms

within a gene using case/control or family data: application to

HLA in type 1 diabetes. Am J Hum Genet 70:124–141.Cox D. 1972. Regression models and life-tables (with discussion).

J R Stat Soc B 34:187–220.De Vivo I, Huggins G, Hankinson S, Lescault P, Boezen M, Colditz

G, Hunter D. 2002. A functional polymorphism in the

promoter of the progesterone receptor gene associated with

endometrial cancer risk. Proc Natl Acad Sci, USA 99:

12263–12268.Epstein M, Satten G. 2003. Inference on haplotype effects in case-

control studies using unphased genotype data. Am J Hum

Genet 73:1316–1329.Heinonen O, Albanes D, Virtamo J, Taylor P, Huttunen J, Hartman

A, Haapakoski J, Malila N, Rautalahti M, Ripatti S, et al. 1998.

Prostate cancer and supplementation with alpha-tocopherol

and beta-carotene: incidence and mortality in a controlled trial.

J Natl Cancer Inst 90:440–446.Kraft P, Cox D, Paynter R, Hunter D, DeVivo I. 2005. Accounting

for haplotype uncertainty in matched association studies: a

comparison of simple and flexiable techniques. Genet Epide-

miol 28:261–272.Lake S, Lyon H, Tantisra K, Silverman E, Weiss S, Laird N, Schaid

D. 2003. Estimation and tests of haplotype-environment

interaction when linkage phase is ambiguous. Hum Hered 55:

56–65.

325Matched Haplotype-Based Association Studies

Genet. Epidemiol. DOI 10.1002/gepi

Page 11: Statistical methods for haplotype-based matched case-control association studies

Liddell F, McDonald J, Thomas D. 1977. Methods of cohortanalysis: appraisal by application to asbestos mining. J R StatSoc A 140:469–491.

Lin D, Zeng D. 2006. Likelihood-based inference on haplotypeeffects in genetic association studies (with discussion). J AmStat Assoc 101:89–104.

Pike M, Hill A, Smith P. 1980. Bias and efficiency in logisticanalyses of stratified case-control studies. Int J Epidemiol 9:89–95.

Risch N. 2000. Searching for genetic determinants in the newmillennium. Nature 405:847–856.

Risch N, Merikangas K. 1996. The future of genetic studies ofcomplex human diseases. Science 273:1516–1517.

Satten G, Epstein M. 2004. Comparison of prospective andretrospective methods for haplotype inference in case-controlstudies. Genet Epidemiol 27:192–201.

Schaid D. 2004. Evaluating associations of haplotypes with traits.Genet Epidemiol 27:348–364.

Schaid D, Rowland C, Tines D, Jacobson R, Poland G. 2002. Scoretests for association between traits and haplotypes whenlinkage phase is ambiguous. Am J Hum Genet 70:425–434.

Schwartz G. 1986. Estimating the dimension of a model. Ann Stat 6:461–464.

Spinka C, Carroll R, Chatterjee N. 2005. Analysis of case-controlstudies of genetic and environmental factors with missinggenetic information and haplotype-phase ambiguity. GenetEpidemiol 29:108–127.

Stram D, Pearce C, Bretsky P, Freedman M, Hirschhorn J,Altshuler D, Kolonel L, Henderson B, Thomas D. 2003a.Modelling and E-M estimation of haplotype-specific relative

risks from genotype data for a case-control study of unrelated

individuals. Hum Hered 55:179–190.Stram D, Haiman C, Hirschhorn J, Altshuler D, Kolonel L,

Henderson B, Pike M. 2003b. Choosing haplotype-tagging

SNPs based on unphased genotype data using a preliminary

sample of unrelated subjects with an example from the

multiethnic cohort study. Hum Hered 55:27–36.Tibshirani R. 1996. Regression shrinkage and selection via the

lasso. J R Stat Soc B 58:267–288.Valle T, Tuomilehto J, Bergman R, Ghosh S, Hauser E, Eriksson J,

Nylund S, Kohtamaki K, Toivanen L, Vidgren G, and others.

1998. Mapping genes for NIDDM: design of the Finland-

United States Investigation of NIDDM Genetics (FUSION)

study. Diabetes Care 21:949–958.Wang D, Fan J, Siao C, Berno A, Young P, Sapolsky R, Ghandour

G, Perkins N, Winchester E, Spencer J, et al. 1998. Large-scale

identification, mapping, and genotyping of single-nucleotide

polymorphisms in the human genome. Science 280:

1077–1082.Zaykin D, Westfall P, Young S, Karnoub M, Wagner M, Ehm M.

2002. Testing association of statistically inferred haplotypes

with discrete and continuous traits in samples of unrelated

individuals. Hum Hered 53:79–91.Zhang H, Zheng G, Li Z. 2006. Statistical analysis for haplotype-

based matched case-control studies. Biometrics 62:1124–1131.Zhao L, Li S, Khalid N. 2003. A method for the assessment of

disease associations with single-nucleotide polymorphism

haplotypes and environmental variables in case-control stu-

dies. Am J Hum Genet 72:1231–1250.

326 Zhang et al.

Genet. Epidemiol. DOI 10.1002/gepi