Statistics for association study
-
Upload
maulikupadhyay -
Category
Documents
-
view
218 -
download
0
Transcript of Statistics for association study
-
8/11/2019 Statistics for association study
1/91
Statistics for AssociationStudies
DEPT. OF ANIMAL GENETICS & BREEDINGCOLLEGE OF VETERINARY SCIENCE AND ANIMAL HUSBANDRY
ANAND AGRICULTURAL UNIVERSITYANAND - 388 001.
Maul ! U"a#$%a%REG. NO. - O'-1'0(-)010
AGB-*+1
MAJOR ADVISOR
DR. C. G. JoshiProfessor and Head
Dept. of Animal Biotechnology
MINOR ADVISOR
DR. D. N. RankProfessor and Head
Dept. of Animal Genetics & Breeding
POST-GRADUATESEMINAR
ON
1
-
8/11/2019 Statistics for association study
2/91
Conclusion
Defnition, need andscope Methods to control Multiple correction
Single SNP Multiple SNP HaplotypemodelsBayesian
introduction
SNP quality control Missing data Imputation
Defnition, need and outline in!age and "ssociation studies
#lo$ o% Presentation
&
-
8/11/2019 Statistics for association study
3/91
Increasing 'rend((((Nature Reviews Geneticscarried nearly )* re+ie$articles related toassociation analysis one$ay or another -.pto,&**/0
Lancet pu lished a serieso% re+ie$ and introductoryarticles in &**2 on geneticepidemiology $ithassociation as the ma3orcomponent
Annual Review 3ournalspu lished many re+ie$sthat can e lin!ed toassociation studies - ee,&**/0
)
-
8/11/2019 Statistics for association study
4/91
' o t a l N u m
1 e r o
%
P u 1 l i c a
t i o n s
Calendar 4uarter
951
Pu lished 56" 7eports, &**2 8 9:&*11
;!http"##$$$.genome.go%#g$ast dies#'
http://www.genome.gov/gwastudies/http://www.genome.gov/gwastudies/ -
8/11/2019 Statistics for association study
5/91
Defnition"n association et$een a SNP < a phenotype that is present in thepopulation %rom $hich a sample is ta!en
-Stephens and Balding, &**=0
5enetic association studies aim to detect association et$een oneor more genetic polymorphisms and a trait, $hich might e somequantitati+e characteristic or a discrete attri ute or disease
-Cordell and Clayton,&**20
5enetic association studies assess correlations et$een genetic+ariants and trait di?erences on a population scale-Cordon and Bell, &**10
2
-
8/11/2019 Statistics for association study
6/91
@Aample o% "ssociation in Case ControlStudy
Control * 1 1 1 1 1 * 1 * & 1 & & * 1 * * * 1 1 & * 1 1 1 & * * * 1 * 1 1 * 1 1 * 1 * *
& * 1 & & * 1 & 1 * * 1 1 * 1 * * 1 1 1 1 & 1 1 & 1 1 1 1 * 1 1 1 * * & & & * &
Cases 1 1 & 1 * 1 & 1 1 1 1 & 1 & 1 & 1 & 1 1 & & 1 & * 1 * * * 1 & & 1 & 1 & 1 * & 1 * 1 1 * * & 1 * * & 1 1 1 & 1 1 & * 1 * * 1 1 * * 1 * & & 1 1 1 1 & * 1 & 1 1 &
5oal 'o identi%y the genetic asis o% gi+enphenotypes or diseases
-7e% A%ord .ni+ersity 6e site, http ::$$$ stats oA ac u!:Emc+ean:g$a; pd% 0
9
http://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/ -
8/11/2019 Statistics for association study
7/91
in!age and "ssociation
"ssociationdi?ers %romlin!age in thatthe same allele-or alleles0 isassociated $ith
the trait in asimilar manneracross the $holepopulation, $hilelin!age allo$s
di?erent allelesto e associated$ith the trait indi?erent%amilies-Cardon and Bell, &**10
/
-
8/11/2019 Statistics for association study
8/91
-7e% A%ord .ni+ersity 6e site, http ::$$$ stats oA ac u!:Emc+ean:g$a; pd% 0 F
http://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/ -
8/11/2019 Statistics for association study
9/91
Causes o% associationthe polymorphism has a causal role -Direct association0 the polymorphism has no causal role ut is associated $itha near y causal +ariant -Indirect association0G orthe association is due to some underlying stratifcation oradmiAture o% the population -Con%ounded association0
-Cordell and Clayton, &**90=
-
8/11/2019 Statistics for association study
10/91
'ypes o% genetic association
Candidate polymorphism
Candidate gene
#ine mapping
5enome $ide association
-Stephens and Balding, &**=01*
-
8/11/2019 Statistics for association study
11/91
Designs %or genetic associationstudiesFo o!in" are di#erent t$%es
of desi"ns of "eneticassociation studies
Statistica ana $sis
10 Cross sectional ogistic : inear regression, chisquare test
&0 Cohort studies Sur+i+al analysis method)0 Case control ogistic : inear regression, chi
square test;0 @Atreme +alue inear regression < Permutation
approach
20 Case Parent triad 'D', ogistic, og linear method90 Case Parent 5rand parentseptets
og linear methods
/0 5eneral pedigree PD', #amily ased associationtest, 'D'
F0 Case only ogistic regression, Chi square-Cordell and Clayton,&**20 11
-
8/11/2019 Statistics for association study
12/91
'est o%"ssociation Single SNPassociation
Chi squaretest
"rmitage test
#isher s eAacttest
5eneral
linear modelogistic
regressionmodels
-Balding, &**901&
-
8/11/2019 Statistics for association study
13/91
'est o%"ssociation Multiple SNPassociation
MD7
SNP setassociation
ogistic7egression
Haplotypeased
regressionmodel
1)
-
8/11/2019 Statistics for association study
14/91
SNP 4uality Control 'he quality control -4C0 fltering o% single nucleotide
polymorphisms -SNPs0 is an important step especially in genome$ide association studies to minimiJe potential %alse fndings
SNP 4C commonly uses eApert guided flters ased on 4C+aria les, to remo+e SNPs $ith insuKcient genotyping quality,such as ( Hardy86ein erg equili rium ( missing proportion -MSP0 ( minor allele %requency -M"#0
#ollo$ing are some o% the criteria %or SNP 4C-i0 percentage o% SNPs eAcluded due to lo$ quality-ii0 inLation %actor o% the test statistics - )-iii0 num er o% %alse associations %ound in the fltered dataset-i+0 num er o% true associations missed in the fltered dataset
-Pongpanich et al., &*1*01;
-
8/11/2019 Statistics for association study
15/91
SNP quality control -4C0 is commonly sa%eguarded y super+ised -i eeApert guided0 flters to eAclude lo$ quality SNPs
'he super+ised eApert flters aim to remo+e SNPs that %all into theeAtremes o% 4C +aria les including Hardy86ein erg equili rium -H6@0,missing proportion -MSP0 and minor allele %requency -M"#0
'he rationale is clear
( eAtreme de+iation %rom H6@ is typically used to identi%y grossgenotyping error -'eo et al., &**/0
( a high MSP indicates poor genotype pro e per%ormance and lo$genotyping accuracy -Neale and Purcell, &**FG 6'CCC, &**/0
( SNPs $ith lo$ M"# are more prone to error, as %e$er samples $oulde $ithin a genotype cluster and most clustering ased callingalgorithms do not per%orm $ell $ith rare alleles -Neale and Purcell,&**FG 'eo, &**F0
12
-
8/11/2019 Statistics for association study
16/91
#or single SNP analyses, i% a %e$ genotypes are missing there isnot much pro lem
#or multipoint SNP analyses, missing data can e morepro lematic ecause many indi+iduals might ha+e one or more
missing genotypes
ne con+enient solution is data imputation replacing missinggenotypes $ith predicted +alues that are ased on the o ser+edgenotypes at neigh ouring SNPs
5enotype imputation is the term used to descri e the process o%predicting or imputing genotypes that are not directly assayedin a sample o% indi+iduals
Missing 5enotypicImputation
19
!Balding) *++,'
-
8/11/2019 Statistics for association study
17/91
'here are se+eral distinct scenarios in $hich genotype imputation isdesira le, ut the term no$ most o%ten re%ers to the situation in$hich a re%erence panel o% haplotypes at a dense set o% SNPs is
used to impute into a study sample o% indi+iduals that ha+e eengenotyped at a su set o% the SNPs- Marchini and Ho$ie, &*1*0
Imputation methods $or! y com ining a re%erence panel o%indi+iduals genotyped at a dense set o% polymorphic sites -usuallysingle nucleotide polymorphisms, or SNPs 0 $ith a study samplecollected %rom genetically similar population and genotyped at asu set o% these sites -Ho$ie et al., &**=0
Imputation methods either see! a est prediction o% a missing
genotype, such as a maximum likelihood estimate -singleimputation0, or randomly select it %rom a pro a ility distri ution(multiple imputations 0 -Balding, &**90
'he goal is to predict the genotypes at the SNPs that are notdirectly genotyped in the study sample
1/
-
8/11/2019 Statistics for association study
18/91
"": "C : CC
"": "': ''
55: 5' : ''
"" : "5: 55
"": "C: CC
CC: C5: 55
"C
5'
""
55
":C * * 1 * * *
":' * & * 2 * )
5:' * * 1 * * *
":5 1 * * * * *
":C * 1 * * * =
C:5 * * * * 1 *
ser+ed5enotypes
Imputation7e%erence
Predicted
5enotypes
Some
"lgorithms
PosteriorPro a ility
Imputation o%5enotypes
1F
-
8/11/2019 Statistics for association study
19/91- Marchini and Ho$ie, &*1*0
1=
-
8/11/2019 Statistics for association study
20/91
Genot$%e I&%utation Met'ods (o! it )or*s+
I !"#$ v% @Atension o% HMM
I !"#$ v& More LeAi le than +1 , SNP
di+ided into t$o sets Set '
-
8/11/2019 Statistics for association study
21/91
.ses o%Imputation
Boosting Po$er
#ineMappin
g
Meta
"nalysis
Imputation o%
untyped+ariation
Imputation o%
Non SNP+ariation
Correction o%
genotyping
+ariation
- Marchini and Ho$ie, &*1*0&1
-
8/11/2019 Statistics for association study
22/91
Single ocus associationanalysis
&&
-
8/11/2019 Statistics for association study
23/91
Pearson goodness o% ft testCategorical data may e displayed in contingency ta les
'he chi square statistic compares the o ser+ed count in eachta le cell to the count $hich $ould e eApected under theassumption o% no association et$een the ro$ and columnclassifcations
'he chi square statistic may e used to test the hypothesis o% noassociation et$een t$o or more groups, populations, or criteria
&)
-
8/11/2019 Statistics for association study
24/91
#or a single SNP $ith alleles " and B tested in a case controlstudy, the data generated consist o% siA counts o% thenum ers o% genotypes -"", "B and BB0 in cases and controls
ser+ed +alue %or "" genotypes in cases, 1 a
@Apected +alue %or "" genotypes in cases, @ 1Chi Square statistic
AA A, ,, Tota
Cases a C n case -7 10
Controls d e % ncont
-7 &0
'otal n "" -C 10 n"B -C &0 nBB -C ) 0 ONa. Fu "enot$%e ta/ e for a "enera "enetic&ode
&;
-
8/11/2019 Statistics for association study
25/91
AA A,0,,
Case a cControl d e %
A ,
Case &a &cControl &d e e &%
AA0A, ,,
Case a CControl d e %
- 0 Dominant model allele B increases ris!
-c0 7ecessi+e model t$o copies o% allele B required %orincreased ris!
-d0 Multiplicati+e model r %old increased ris! %or "B, r & increased ris! %orBB "nalysed y allele, not y genotype
- e$is, &**&0&2
-
8/11/2019 Statistics for association study
26/91
-
8/11/2019 Statistics for association study
27/91
@Aactly n "B heteroJygotes
'hus, under the assumption o% H6@, the pro a ility o%o ser+ing eAactly n A* hetero+ -otes in a sample o% N
individuals with n A minor alleles is
'his equation holds %or each possi le num er o%heteroJygotes, n A* .
-6igginton et al. ,&**20 &/
-
8/11/2019 Statistics for association study
28/91
'he eApression %or P-N "B O n "B QN, N" 0 gi+en in equationleads to natural tests %or H6@
ne sided test Defcit o% heteroJygotes, P lo$ O P-N "B R n "B QN, N" 0 -In reeding,Stratifcation0@Acess o% heteroJygotes, P high O P-N "B n "B QN, N" 0 -5enotypingerror0
In each case, the statistic can e calculated y simplysumming o+er equation, to include all possi le +alues o% N "B that are lo$er -%or P lo$ 0 or higher -%or P high 0 than thoseo ser+ed in the actual data
-6igginton et al. ,&**20
&F
-
8/11/2019 Statistics for association study
29/91
Control genotypes should e in Hardy86ein erg equili rium,pro+ided the population they are selected %rom is random matingand is large in siJe
Suppose the population %requency o% allele " is p and allele B is qO1 p, then the genotypes "", "B and BB should ha+e %requency p &,&pq and q &
Pro+ided the controls are in H6@, the cases may then e tested I%the SNP has a true genetic e?ect that is no controlled y amultiplicati+e model, the cases $ill not e in H6@ -although again,the test has little po$er to detect small departures %rom H6@0 I% thecases are in H6@, the data may e analysed y allele counting, asany genetic e?ect is consistent $ith a multiplicati+e model
" signifcant result sho$ing that controls are not in Hardy86ein ergequili rium -H6@0 could arise ecause o% ( random chance ( genotyping pro lems ( heterogeneous population - e$is, &**20
&=
-
8/11/2019 Statistics for association study
30/91
)*
-
8/11/2019 Statistics for association study
31/91
T'e Odds Ratio a Measure ofAssociation
" use%ul statistic %or measuring the le+el o% association incontingency ta les is the odds ratio , .I% the odds are equal, their ratio equals one " sample estimator o%the odds ratio , R is
dd 7atio " measurement o% association that is commonly used in
case control studies It is defned as the odds o% eAposure to thesuscepti le genetic +ariant in cases compared $ith that incontrols I% the odds ratio is signifcantly greater than one, then thegenetic +ariant is associated $ith the disease
-6ang et al.,&**20
D -T0 O
)1
-
8/11/2019 Statistics for association study
32/91
Confdence Inter+al < Interpretation
Standard error is +ery much necessary to fnd confdenceinter+al %or null hypothesis o% no association
CI %or 7 O 7U1 =9V 7VWCI %or D O DU 1 =9V W
SNP has no inLuence on disease i% the =2X CI %or 7includes 1 or CI %or D includes *
)&
-
8/11/2019 Statistics for association study
33/91
))
-
8/11/2019 Statistics for association study
34/91
"rmitage s 'rend test 'he disad+antages o% Population stratifcation and con%ounding %actoris o+ercomed, to some eAtent, y applying the "rmitageYs trend test,as suggested y "rmitage -1=220, Sasieni -1==/0, and Schaid and
Zaco sen -1===0
'here are three common choices o% scoring system10 co dominant score A * O *, A 1 O 1, and A & O &G&0 dominant score A * O *, A 1 O 1, and A & O 1G)0 recessi+e score A * O *, A 1 O *, and A & O 1
Here, the names o% scoring systems are in %a+our o% the minor allele[m\
Genot$%es
MM Mm mm 'otal
Case n 1* n 11 n 1& N1Control n ** n *1 n *& N* 'otal N * N 1 N & N
Score ] * ] 1 ] &
);
-
8/11/2019 Statistics for association study
35/91
-
8/11/2019 Statistics for association study
36/91
#igure & Q Ar&ita"e test of sin" e-SNP association !it'case2contro outco&e3
'he dots indicate the proportion o%cases, among cases and controlscom ined, at each o% three SNPgenotypes -coded as *, 1 and&0,together $ith their least squares line
'he "rmitage test corresponds totesting the hypothesis that the linehas Jero slope Here, the line fts thedata reasona ly $ell as theheteroJygote ris! estimate isintermediate et$een the t$ohomoJygote ris! estimatesG thiscorresponds to additi+e genotyperis!s
'he test has good po$er in this case
ut po$er is reduced y de+iations%rom additi+ity
In an eAtreme scenario, i% the t$ohomoJygotes ha+e the same ris! utthe heteroJygote ris! is di?erent-o+er dominance0, then the "rmitagetest $ill ha+e no po$er %or any
-Balding, &**90
)9
-
8/11/2019 Statistics for association study
37/91
'ransmission Disequili rium 'est
'he 'D' tests %or oth lin!age and association in %amilies $itho ser+ed transmissions %rom parents to a?ected o?spring-Spielman et al., 1==)0
It $as originally de+eloped to test %or lin!age in the presence o%association, ut its most common usage is no$ to test %orassociation in the presence o% lin!age, since it is ro ust againstpopulation stratifcation
'he 'D' tests %or distortion in transmission o% alleles %rom aheteroJygous parent to an a?ected o?spring
)/
!-e$is) *++*'
-
8/11/2019 Statistics for association study
38/91
)F
!-e$is) *++*'
-
8/11/2019 Statistics for association study
39/91
Continuous outcomes inear7egression -5 M0
inear models are used to study ho$ a quantitati+e +aria ledepends on one or more predictors or eAplanatory +aria les
'he predictors themsel+es may e quantitati+e or qualitati+e-7odrigueJ, &**/0
y = 0 + 1 x + $here
/ dependent varia0le x / independent varia0le 0 , 1 / re-ression parameters
1 / random error
)=
-
8/11/2019 Statistics for association study
40/91
So%t$are used S"S F *&-P7 C 5 M0
;*
-
8/11/2019 Statistics for association study
41/91
5eneraliJed inear Model5eneraliJed linear models -5 Ms0 are a large class o% statisticalmodels %or relating responses to linear com inations o% predictor+aria les, including many commonly encountered types o%dependent +aria les and error structures as special cases
-Za!man, &**&0
"d+antages o% using 5 Ms
( No need to trans%orm the data into normality ( 5 Ms uni%y a $ide +ariety o% statistical methods
" 5 M generaliJes ordinary regression models in t$o $ays #irst, itallo$s 2 to have a distri0ution other than the normal. econd, itallo$s modeling some %unction o% the mean
Both generaliJations are important %or categorical data-"gresti, &**/0
;1
-
8/11/2019 Statistics for association study
42/91
5 Ms %orinary data
ogit Model
ogit in!
Pro it Model
Pro it in!
'rans%orm to^ scores%rom snd
;&
-
8/11/2019 Statistics for association study
43/91
ogit %or single SNP@ach su 3ect in our sample consists o% a -y iG Ai0 pair $here y i is case:control status -1:*0 and A i -*,1,&0 is the genotype attyped locus Genot$
%e4 i Odds Para&e
tersaa * _ ` *"a 1 _ -1 0 ` 1"" & _
-1 0 &` &
-7e% A%ord .ni+ersity 6e site, http ::$$$ stats oA ac u!:Emc+ean:g$a; pd% 0 ;)
http://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/ -
8/11/2019 Statistics for association study
44/91
No$ trans%ormation logit - 3) / lo- (3 4 (% 5 3)) is applied to 3 i , thedisease risk o' the i6th individual.
'he +alue o% logit -b i0 is equated to either ` * , ` 1 , or ` &, according to thegenotype o% indi+idual i -` 1 %or heteroJygotes0
'he li!elihood ratio test o% this general model, against the nullhypothesis ` *O` 1O ` & , has & d % ,and %or large sample siJes isequi+alent to the Pearson & d% test
.sers can impro+e the po$er to detect specifc disease ris!s, at the
cost o% lo$er po$er against some other ris! models, y restricting the+alues o% ` *, ` 1 and ` &
'ests %or recessi+e or dominant e?ects can e o tained y requiringthat ` * O ` 1 or ` 1 O ` & -Balding, &**90
;;
ogistic 7egression o%
-
8/11/2019 Statistics for association study
45/91
ogistic 7egression o%Melanoma status on
5enotypeRis* Factor Odds Ratio 95 6I P 7a ue
Models $ithout Co+ariateSNP no o%copies ['\alleles
* /F * 9/ * =) * **;
Models $ith intermediate %actor as co+ariateSNP no o%copies o% ['\
allele
* F= * /; 1 */ * &)
Ne+us count & 9* & &F & =/ 1* ;)
;2
! eggini and /orris) *+00'
-
8/11/2019 Statistics for association study
46/91
'est o% association Multiple SNPs
;9
-
8/11/2019 Statistics for association study
47/91
Set association, to e+aluate sets o% SNP mar!ers at +arious positions
in the genome -in particular, in di?erent suscepti ility genes0
'his method per%orms a simultaneous signifcance test on se+eral setso% loci $hile !eeping the o+erall type I error in control
SNP set ased analysis orro$s in%ormation %rom di?erent utcorrelated SNPs that are grouped on the asis o% prior iological!no$ledge and hence has the possi ility o% pro+iding results $ithimpro+ed reproduci ility and increased po$er, especially $henindi+idual SNP e?ects are moderate, as $ell as impro+edinterpreta ility
'o increase the po$er o% the test, sometime it is %easi le to com inerele+ant sources o% in%ormation %or a gi+en SNP, such as
"llelic association -""0, Hardy 6ein erg disequili rium -H6D0, ande+idence %or genotyping errors -Heidema et al ,
SNP set analysis
-6u et al., &*1*0
;/
-
8/11/2019 Statistics for association study
48/91
'his mode o% analysis proceeds +ia a t$o step procedure ( SNP are assigned to set on the asis o% some meaning%ul
iological criteria -genomic %eatures0 e g 5enes
( 'hen, tests %or the association et$een each genomic %eatureand a disease phenotype are per%ormed $ith the use o% alogistic !ernel machine ased multilocus test, across thegenome
SNP set analysis can pro+e ad+antageous o+er the standardanalysis o% indi+idual SNPs By %orming SNP sets and testing eachSNP set as a unit, $e are reducing the num er o% hypotheses eingtested and thus relaAing the stringent conditions %or reachinggenome $ide signifcance in case o% 56"
'here are %ollo$ing $ays o% grouping SNPs into set ( SNP location in the gene as or near to gene -gene ased
set analysis0 ( Set %ormation on the asis o% @55 path$ay ( 5roup SNPs onto e+olutionary conser+ed regions ( 5rouping SNPs into haplotype loc!s
-6u et al., &*1*0
;F
-
8/11/2019 Statistics for association study
49/91
5enome $ide SNP set testing
"ssume population ased case control status -#or a single set0 ( let J i1, J i&, , J ip e genotype +alues %or the SNPs in the SNP set
%or the I th su 3ect -i O 1, ,n0 ( 'he case control status %or the i th su 3ect is denoted y y i -y i
O 1 %or cases, and yO * %or controls0 ( J i3 O *, 1, & corresponding to homoJygotes %or the ma3or allele,
heteroJygotes, and homoJygotes %or the minor allele,respecti+ely
( #urther assume collection o% m additional set o% demographic,en+ironmental and other con%ounding +aria les
#or the i th su 3ect let A i1, Ai&, , A im denote the +alues o% theco+ariates that $e $ould li!e to ad3ust %or
'he goal o% SNP set analysis is then to test the glo al null o% $hetherany o% the p SNPs are related to the outcome $hile ad3usting %or theadditional co+ariates
-6u et al., &*1*0;=
-
8/11/2019 Statistics for association study
50/91
ogistic ernel Machine 7egressionModel 'he !ernel machine %rame$or! has ecome +ery popular %ormodelling high dimensional iomedical data ecause o% its a ility toallo$ %or compleA:nonlinear relationships et$een the dependent andindependent +aria les -Bro$n et al., &***0 $hile ad3usting %orco+ariate e?ects.nder the logistic ernel Machine 7egression Model, #ollo$ing is the
model %or SNP 3oint interaction and considering other co+ariates
( In $hich _ * is the intercept ( _ 1, _ &, , _ m are regression coeKcients corresponding to the
en+ironmental and demographic co+ariates ( 'he SNPs, J i1 , , J ip, inLuence y i through the general %unction h- 30,
$hich is an ar itrary %unction that that has a %orm defned only ya positi+e, semi defnite !ernel %unction - 3,30-6u et al., &*1*0
2*
-
8/11/2019 Statistics for association study
51/91
MD7 is a nonparametric data mining approach 'o reduce t$o or more SNPs, %or eAample, to a ne$ single+aria le that is then e+aluated using a classifer such asBayes or logistic regressionIn MD7, each multi locus genotype o% a SNP com ination is
assigned to a high ris! or lo$ ris! group, depending on theratio o% cases and non cases $ith this multi locus genotypeI% this ratio eAceeds a certain threshold, this multi locusgenotype is assigned to as high ris!, other$ise it isassigned to as lo$ ris!
By assigning all multi locus genotypes %or a certaincom ination o% SNPs to either high ris! or lo$ ris!, MD7reduces the num er o% multi locus genotypes to one ris!%actor consisting o% t$o le+els, high ris! or lo$ ris!
'he aim is to construct a ne$ ris! %actor that %acilitates the
detection o% nonlinear interactions among SNPs such thatthe rediction o% the outcome +aria le is im ro+ed o+er the
Multi Dimensional 7eduction
-7itchie et al., &**1021
-
8/11/2019 Statistics for association study
52/91
- ee et al. ,&**F02&
-
8/11/2019 Statistics for association study
53/91
2)
-
8/11/2019 Statistics for association study
54/91
ogistic regressionogistic regression analyses %or SNPs are a natural eAtension o% thesingle SNP analyses that are discussed in pre+ious slides there is no$a coe%%icient -`*, `1 or `&0 %or each SNP, leading to a general test $ith& d% By constraining the coeKcients, tests $ith d% can e o tained
Co+ariates such as seA, age or en+ironmental eAposures are readily
included Similarly, interactions et$een SNPs can e included-Balding, &**90
'his con+eys little eneft, and can reduce po$er to detect anassociation, i% there is a single underlying causal +ariant and little orno recom ination et$een SNPs, ut it is potentially use%ul %orin+estigating epistatic e?ects
-6u et al., &*1*0
2;
-
8/11/2019 Statistics for association study
55/91
-
8/11/2019 Statistics for association study
56/91
tSNP can sometimes pro+ide greater analytical po$er thansingle mar!er analysis %or genetic association studies
'his is ecause haplotypes are inherited together in thema3ority o% cases, and they incorporate linkagedisequilibrium in%ormation -"!ey and ]iong, &**)G Schaid,&**;0
Con+ersely, haplotype ased statistical analysis has a$ea!ness since haplotypes are o%ten not directlyo ser+a le
Hence, haplotypes and their %requencies are in%erred ystatistical methods such as the @Apectation MaAimiJation-@M0 algorithm -Dempster et al., %7889 @AcoKer andSlat!in, 1==20 or the Bayesian method -Stephens et al.,&::%9 in et al., &::&).
29
-
8/11/2019 Statistics for association study
57/91
5i+en haplotype assignments, the simplest analysis in+ol+estesting %or independence o% ro$s and columns in a & kcontin-enc ta0le, where k denotes the num0er o% distinct
haplotypes -Sham, 1==F0
"lternati+e approaches can e ased on the estimatedhaplotype proportions among cases and controls, $ithout aneAplicit haplotype assignment %or indi+iduals -Schaid, &**;0
the test compares the product o% separate multinomialli!elihoods %or cases and controls $ith that o tained ycom ining cases and controls
Haplotype ased regression model is +ery use%ul in
haplotype ased association study
2/
-
8/11/2019 Statistics for association study
58/91
!1ang et al., *++2'
2F
7 g i M d l %
-
8/11/2019 Statistics for association study
59/91
7egression Models %orHaplotypes
6ithin the %rame$or! o% the generaliJed linear model -5 M0, thehaplotype e?ect on traits can e statistically descri ed and tested
'he model can e eApressed as @- 0 O % 1 -3 ')$here denotes the trait
] represents the haplotypes that are coded into the desi-nmatrix )
denotes the e?ects o% haplotype, and% is a %unction that generaliJes the usual linear regression
such as logistic regression in the case control study
!4ohee et al., *++5'
2=
-
8/11/2019 Statistics for association study
60/91
et O * %or Control and 1 %or Case
et -h i, h 30 e a random +aria le that denotes the pair o%
haplotypes %or each indi+idual, iO3 or i 3 et H O h 1 , h & , , h pe a set o% haplotypes
MaAimum num er o% possi le haplotypes is & m , $here m is thenum er o% SNPsIn association studies, the main interest lies in estimating thee?ects o% H on So, in nutshell, regression models %or Haplotypes consists o%
Predictor Haplotype counts7egression Parameters Phenotypic e?ect o% eachhaplotypesutcome 'he phenotype o% interest
!4ohee et al .) *++5'
9*
-
8/11/2019 Statistics for association study
61/91
Direct Design MatriA
Indi7idua
(a% ot$%es
Pro/a/iit$
Direct Desi"n Matri4
h 1 h & h ) h ;
1 -h 1,h 10 1 1 * * *
& -h1, h ; 0 * & * 1 * ; * ; * 1-h &, h ) 0 * F
) -h &, h &0 1 * 1 * *
; -h &, h ; 0 1 * * 2 * * 2
2
-h 1, h &0 * &* &2 * &* * &2 * )*-h 1, h ; 0 * )
-h &, h ) 0 * )
-h &, h ; 0 * & 'he direct type o% design matriA relies on the estimated haplotypepro a ilities -proportions0
!4ohee et al., *++5'91
I di D i M iA
-
8/11/2019 Statistics for association study
62/91
Indirect Design MatriA
Indi7idua (a% ot$%es Pro/a/iit$
Indirect Desi"n Matri4
h 1 h & h ) h ; 6eight
1 -h 1,h 10 1 & * * * 1
&-h 1, h ; 0 * & 1 * * 1 * &
-h &, h ) 0 * F * 1 1 * * F ) -h &, h &0 1 * & * * 1
; -h &, h ; 0 1 * 1 * 1 1
2
-h 1, h &0 * & 1 1 * * * &
-h 1, h ; 0 * ) 1 * * 1 * )-h &, h ) 0 * ) * 1 1 * * &
-h &, h ; 0 * & * * 1 1 * )
!4ohee et al., *++5'9&
I d i B i " h
-
8/11/2019 Statistics for association study
63/91
Introduction to Bayesian "pproach
" statistical school o% thought that holds that in%erences a out anyun!no$n parameter or hypothesis should e encapsulated in apro a ility distri ution, gi+en the o ser+ed data Computing thisposterior pro a ility distri ution usually proceeds y speci%ying aprior distri ution that summariJes !no$ledge a out the un!no$ne%ore the o ser+ed data are considered, and then using Bayestheorem to trans%orm the prior distri ution into a posteriordistri ution
Bayesian methods pro+ide an alternati+e approach to assessingassociations that alle+iates the limitations o% p;values at the cost o'some additional modellin- assumptions
Bayesian methods compute measures o% e+idence that can edirectly compared among SNPs $ithin and across studies
9)
-Stephens < Balding,&**=0
C l l i P ili i % " i i
-
8/11/2019 Statistics for association study
64/91
Calculating Pro a ilities o% "ssociation
'his deals $ith computing, %or each SNPs in 56"S, the pro a ilitythat it is truly associated $ith the phenotype a!a [ PosteriorPro a ility o% "ssociation -PP"0\
'his posterior pro a ility o% association -PP"0 can e thought o% asthe Bayesian analogue o% a p;value o0tained, 'or eAample, y
using the "rmitage trend test -"''0 or the #isher eAact test
'he calculation o% PP" can e split into three di?erent steps ( Choose a +alue %or b, the prior pro a ility o% H 1 ( Compute a Bayes %actor %or each SNP
( Calculate the posterior odds on H 1
-Stephens < Balding,&**=0
9;
Step I Choose a +alue %or b the prior pro a ility o% H
-
8/11/2019 Statistics for association study
65/91
Step I Choose a +alue %or b, the prior pro a ility o% H 1 ( b +alue quanti%ies our prior assumption o% each SNPs
eing associated
( alue o% b %or H 1 depends on prior !no$ledge, %oreAample M"#, ProAimity to certain genes o% interestetc
( i% b is assumed to e the same %or all SNPs, it can einterpreted as a prior estimate o% the o+erall proportion
o% SNPs that are truly associated $ith a phenotype ( 'ypically, only a minority o% SNPs is eApected to e trulyassociated $ith a gi+en phenotype the range 1* 8; to 1* 89 has een suggested %or 3. -'he 6ellcome 'rustCase Control Consortium, &**/0
( 'he pro a ility o% H * is ta!en to e 1 8 3.Step II Compute a Bayes %actor %or each SNP
( " Bayes %actor -B#0 is the ratio et$een the pro a ilitieso% the data under H 1 and under H *
( 'he B# is similar to a li!elihood ratio, ut it comparest$o di?erent models rather than t$o parameter +alues
-Stephens < Balding,&**=0 92
Steps III Calculate the posterior odds on H
-
8/11/2019 Statistics for association study
66/91
( 'he B# and b can e used to compute posterior odds on H 1
( 'his can e used to calculate PP"
( 'he PP" can e interpreted directly as a pro a ility, irrespecti+e o%po$er, sample siJe or ho$ many other SNPs $ere tested
( Intuiti+ely, the PP" com ines the e+idence in the o ser+edassociation data -the B#0 $ith the prior pro a ility -b0 that a SNP istruly associated $ith phenotype Because b is typically so small, theB# has to e large -%or eAample, j1*; 8 1*90 to pro+ide con+incinge+idence %or an association -that is, to gi+e a PP" close to 10
( 'he requirement %or a large B# is analogous to setting a stringentthreshold %or genome $ide signifcance in a %requentist approach
Steps III Calculate the posterior odds on H 1
-Stephens and Balding,&**=0
99
-
8/11/2019 Statistics for association study
67/91
-
8/11/2019 Statistics for association study
68/91
Population stratifcation is pro a ly the most o%ten cited reason %ornon replication o% genetic association results, $hich ha+eun%ortunately een more the rule than the eAception-'a or et al., &**), 6eiss and 'er$illiger, &***0
eading scientifc 3ournals ha+e noted the importance o% populationstratifcation as a cause o% non replicated association outcomes,-"non, 1===0 and it is usual practice in grant applications andmanuscript peer re+ie$ to demand that stratifcation is eAplicitly
addressed -5auderman, 1===0
'$o circumstances must e met %or population stratifcation toa?ect genetic association studies
i Di?erences in disease pre+alence must eAist et$een cases and
controlsG andii +ariations in allele %requency et$een groups must e present -Stephen et al., &**)0
9F
-
8/11/2019 Statistics for association study
69/91
4 4 P '
-McCarthyetal ,&**F0
9=
-
8/11/2019 Statistics for association study
70/91
-
8/11/2019 Statistics for association study
71/91
'he e?ect o% stratifcation on association studies - a. Strati8cationinf ates : association statistics /$ a factor ;< !'ic' c'an"esde%endin" on t'e sample siJe Scenario 1 corresponds to grossstratifcationG scenarios & and ) correspond to the range o%stratifcation estimated in the "%rican "merican prostate cancer studyGand scenario ; corresponds to no stratifcation -#reedman et al.,
&**;0 /1
-
8/11/2019 Statistics for association study
72/91
In 5enomic Control -5C0 the "rmitage test statistic iscomputed at each o% the null SNPs, and k is calculated asthe empirical median di+ided y its eApectation under the
&1 distri ution-De+lin and 7oeder, 1===0
'hen the "rmitage test is applied at the candidate SNPs,and i% k j 1 the test statistics are di+ided y k
#he motivation 'or G> is that, as we expect 'ew i' an o' thenull N!s to 0e associated with the phenot pe, a value o' ? % is likel to 0e due to the e
-
8/11/2019 Statistics for association study
73/91
-
8/11/2019 Statistics for association study
74/91
ther approachesNull SNPs can mitigate the e?ects o% population structure $henincluded as co+ariates in regression analyses
-Seta!is et al., &**90
i!e 5C, this approach does not eAplicitly model the populationstructure and is computationally %ast, ut it is much more LeAi le
than 5C ecause epistatic and co+ariate e?ects can e included inthe regression model-Balding, &**90
@mpirically, the logistic regression approaches sho$ greater po$erthan 5C, ut their type 1 error rate must e assessed throughsimulation -Seta!is et al., &**90
/;
Multiple testing
-
8/11/2019 Statistics for association study
75/91
Multiple testingIt re%ers to the pro lem that arises $hen many nullhypotheses are testedG some signifcant results are li!ely
e+en i% all the hypotheses are %alse-Balding, &**90
@specially in 56", @ach SNP that is analyJed constitutesone hypothesis test In traditional hypothesis testing, thesignifcant le+el is o%ten set at 2X
Ho$e+er, as sho$n in the ta le elo$ as the num er o%SNPs tested increases, the num er o% SNPs %alsely claimedto e signifcant increases, pro+ided that all the SNPs arenon signifcantNo3 of SNP tested Fa se %ositi7e
1** 21*,*** 2**2,**,*** &2,***
-Scott,&**=0
/2
-
8/11/2019 Statistics for association study
76/91
-
8/11/2019 Statistics for association study
77/91
Sequential Bon%erroni re3ection 'his $as proposed y Holm -1=/=0
'he method is ased on the Bon%erroni test and requiresthe type I error to e as small as possi le
Philosophically, %or each o% these n tests, the pro a ility o%committing a type I error is less than or equal to a smallpredetermined +alue
Notation $hich $e $ill used %urther in this method
n null hypothesis H 1 , H &, H ) , ,H n"lternati+e hypothesis 1 , &, ) , , n
'est statistics 1 , &, ) , , nn critical region C 1 , C &, C ) , , c n -Scott,
&**=0//
-
8/11/2019 Statistics for association study
78/91
No$ let the corresponding p +alues generated %rom the teststatistics, 1 , &, ) , , n , e P 1 , P &, P ) , , P n $here !O1, &, n
6hen these p +alues are ordered, P -10RP -&0RP -)0 R RP -n0,along $ith their corresponding hypotheses,H-10RH -&0RH -)0 R RH -n0, the most signifcant ones $ouldha+e the smallest p +alues
'he S7B method attempts to sol+e this multiple testingpro lem y ad3usting the signifcant le+el , %or eachhypotheses tested, e%ore comparing it $ith the p +aluesSpecifcally, these p +alues are compared to corresponding
le+els denoted y
'he hypotheses are re3ected until no other re3ections arepossi le Since the most important hypotheses $ould ha+ethe smallest p +alues, they are compared $ith the smallest
-Scott,
&**= 0/F
Conclusions
-
8/11/2019 Statistics for association study
79/91
Conclusions#amily ased lin!age mapping is the est approach to detect
region in+ol+ed in recessi+e high penetrant Mendelian
disease and has lo$ resolution $hile population ased
genetic association study helps in mapping compleA traits
and has high resolution
Case Control design applies to disease traits and t$o tailedsampling design applies to quantitati+e traits
Huge genotyping data due to increased usage o% 56"S
study and spurious association due to study design has put
%or$ard computational challenges
H6@, MSP and M"# are the main components in 4C and
imputation methods uses HMM, M , Multiple imputation %or
imputing missing alleles /=
ogistic regression can ad3ust the e?ect o% co+ariates gi+ing
-
8/11/2019 Statistics for association study
80/91
g g g g
an ad+antage o+er other categorical data tests
In case o% quantitati+e trait linear regression model and its
+ariant such as fAed, random or miAed model can e
applied depending on the types o% independent +aria les
Multiple SNPs model apart %rom logistic regression and
linear regression uses the SNP set analysis $hich gi+esetter result using !ernel %unction and it also uses machine
learning language methods such as MD7, 7#"
Bayesian models computes B# and uses +arious models to
detect the PP" $hich is the analogue o% P +alue in
%requentists approach
Population stratifcation is one o% the main reasons ehind
non replication o% 56"S in case control design F*
Haplotype ased regression models alle+iate the pro lem o%
-
8/11/2019 Statistics for association study
81/91
Haplotype ased regression models alle+iate the pro lem o%
multiple testing and also increases the po$er o% test
5enomic control is the $idely used method to detect
stratifcation
Multiple testing requires the application o% Bon%erroni
correction and #D7 to control type I error
F1
-
8/11/2019 Statistics for association study
82/91
T'an* >ou[In 5od $e trust, all others must ringdata\
67. 8d$ards Deming
F&
-
8/11/2019 Statistics for association study
83/91
Supplementary Slides
-
8/11/2019 Statistics for association study
84/91
Data 4uality Control
#or 56"S conducted y the 6ellcome 'rust CaseControl Consortium -6'CCC0, the criteria %orretaining a SNP are H6@ !;value@ .8B%:58,MSPR2X i% M"# 2X, MSPR1X i% M"# 2X and
M"# ?:.:% -6'CCC, &**/0 Slade! et al. (&::8)included N!s when the C$ !;value?:.::%,!D E and AF?:.:%. "noki et al. (&:: )
included SNPs $hen the H6@ !;value @%:5H and! D%:E.tatistical method used PC"
-
8/11/2019 Statistics for association study
85/91
5enotyping @rror
ariation in DN" sequenceo$ quantity and quality o% DN"Biochemical arti%act and lo$ quality reagentHuman #actor
$
-
8/11/2019 Statistics for association study
86/91
SNP 4C so%$are
S"4C -SNP array 4uality Control0 on 7DBSC"N
-
8/11/2019 Statistics for association study
87/91
PC"
is a mathematical procedure that usesan orthogonal trans%ormation to con+ert a set o%o ser+ations o% possi ly correlated +aria les intoa set o% +alues o% linearly uncorrelated +aria les
called %rinci%a co&%onents
H l t l i
-
8/11/2019 Statistics for association study
88/91
Haplotype analysis
"nalysis methods ased on single SNPs ha+elimited po$er to detect a true genetic e?ect thatrequires a specifc allele at se+eral SNPs 'hismay e detected using haplotype ased methods,
analysing all SNPs concurrently
5enehunter allo$s haplotype analysis o% up to%our SNPs ne o% the most LeAi le programs %or
'D' type analysis is 'ransmit
So%t$ares%or 'D' and Haplotype
-
8/11/2019 Statistics for association study
89/91
So%t$ares%or D and Haplotypeanalysis
'D':Si 'D'http ::genomics med upenn edu:spielman:'D' htm5enehunterhttp ::$$$ %hcrc org:la s:!ruglya!:Do$nloads:indeA htmlPD' http ::$$$ chg du!e edu:so%t$are:pdt html@'D'http ::$$$ mds qm$ ac u!:statgen:dcurtis:so%t$are html
'ransmit http ::%tp gene cimrcam ac u!:clayton:so%t$are :@H %tp::lin!age roc!e%eller edu:so%t$are:eh:
@hplushttp ::$$$ iop !cl ac u!:IoP:Departments:PsychMed:5@piBSt:so%t$ar
@A l % gi ti 7 g i
-
8/11/2019 Statistics for association study
90/91
@Aample o% ogistic 7egression
-
8/11/2019 Statistics for association study
91/91
$here x iT i is the ith row of the design matrix. With suitable choice of designmatrix, the regression coefficients) , are the logarithms of the odds ratio
parameters