Statistics for association study

download Statistics for association study

of 91

Transcript of Statistics for association study

  • 8/11/2019 Statistics for association study

    1/91

    Statistics for AssociationStudies

    DEPT. OF ANIMAL GENETICS & BREEDINGCOLLEGE OF VETERINARY SCIENCE AND ANIMAL HUSBANDRY

    ANAND AGRICULTURAL UNIVERSITYANAND - 388 001.

    Maul ! U"a#$%a%REG. NO. - O'-1'0(-)010

    AGB-*+1

    MAJOR ADVISOR

    DR. C. G. JoshiProfessor and Head

    Dept. of Animal Biotechnology

    MINOR ADVISOR

    DR. D. N. RankProfessor and Head

    Dept. of Animal Genetics & Breeding

    POST-GRADUATESEMINAR

    ON

    1

  • 8/11/2019 Statistics for association study

    2/91

    Conclusion

    Defnition, need andscope Methods to control Multiple correction

    Single SNP Multiple SNP HaplotypemodelsBayesian

    introduction

    SNP quality control Missing data Imputation

    Defnition, need and outline in!age and "ssociation studies

    #lo$ o% Presentation

    &

  • 8/11/2019 Statistics for association study

    3/91

    Increasing 'rend((((Nature Reviews Geneticscarried nearly )* re+ie$articles related toassociation analysis one$ay or another -.pto,&**/0

    Lancet pu lished a serieso% re+ie$ and introductoryarticles in &**2 on geneticepidemiology $ithassociation as the ma3orcomponent

    Annual Review 3ournalspu lished many re+ie$sthat can e lin!ed toassociation studies - ee,&**/0

    )

  • 8/11/2019 Statistics for association study

    4/91

    ' o t a l N u m

    1 e r o

    %

    P u 1 l i c a

    t i o n s

    Calendar 4uarter

    951

    Pu lished 56" 7eports, &**2 8 9:&*11

    ;!http"##$$$.genome.go%#g$ast dies#'

    http://www.genome.gov/gwastudies/http://www.genome.gov/gwastudies/
  • 8/11/2019 Statistics for association study

    5/91

    Defnition"n association et$een a SNP < a phenotype that is present in thepopulation %rom $hich a sample is ta!en

    -Stephens and Balding, &**=0

    5enetic association studies aim to detect association et$een oneor more genetic polymorphisms and a trait, $hich might e somequantitati+e characteristic or a discrete attri ute or disease

    -Cordell and Clayton,&**20

    5enetic association studies assess correlations et$een genetic+ariants and trait di?erences on a population scale-Cordon and Bell, &**10

    2

  • 8/11/2019 Statistics for association study

    6/91

    @Aample o% "ssociation in Case ControlStudy

    Control * 1 1 1 1 1 * 1 * & 1 & & * 1 * * * 1 1 & * 1 1 1 & * * * 1 * 1 1 * 1 1 * 1 * *

    & * 1 & & * 1 & 1 * * 1 1 * 1 * * 1 1 1 1 & 1 1 & 1 1 1 1 * 1 1 1 * * & & & * &

    Cases 1 1 & 1 * 1 & 1 1 1 1 & 1 & 1 & 1 & 1 1 & & 1 & * 1 * * * 1 & & 1 & 1 & 1 * & 1 * 1 1 * * & 1 * * & 1 1 1 & 1 1 & * 1 * * 1 1 * * 1 * & & 1 1 1 1 & * 1 & 1 1 &

    5oal 'o identi%y the genetic asis o% gi+enphenotypes or diseases

    -7e% A%ord .ni+ersity 6e site, http ::$$$ stats oA ac u!:Emc+ean:g$a; pd% 0

    9

    http://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/
  • 8/11/2019 Statistics for association study

    7/91

    in!age and "ssociation

    "ssociationdi?ers %romlin!age in thatthe same allele-or alleles0 isassociated $ith

    the trait in asimilar manneracross the $holepopulation, $hilelin!age allo$s

    di?erent allelesto e associated$ith the trait indi?erent%amilies-Cardon and Bell, &**10

    /

  • 8/11/2019 Statistics for association study

    8/91

    -7e% A%ord .ni+ersity 6e site, http ::$$$ stats oA ac u!:Emc+ean:g$a; pd% 0 F

    http://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/
  • 8/11/2019 Statistics for association study

    9/91

    Causes o% associationthe polymorphism has a causal role -Direct association0 the polymorphism has no causal role ut is associated $itha near y causal +ariant -Indirect association0G orthe association is due to some underlying stratifcation oradmiAture o% the population -Con%ounded association0

    -Cordell and Clayton, &**90=

  • 8/11/2019 Statistics for association study

    10/91

    'ypes o% genetic association

    Candidate polymorphism

    Candidate gene

    #ine mapping

    5enome $ide association

    -Stephens and Balding, &**=01*

  • 8/11/2019 Statistics for association study

    11/91

    Designs %or genetic associationstudiesFo o!in" are di#erent t$%es

    of desi"ns of "eneticassociation studies

    Statistica ana $sis

    10 Cross sectional ogistic : inear regression, chisquare test

    &0 Cohort studies Sur+i+al analysis method)0 Case control ogistic : inear regression, chi

    square test;0 @Atreme +alue inear regression < Permutation

    approach

    20 Case Parent triad 'D', ogistic, og linear method90 Case Parent 5rand parentseptets

    og linear methods

    /0 5eneral pedigree PD', #amily ased associationtest, 'D'

    F0 Case only ogistic regression, Chi square-Cordell and Clayton,&**20 11

  • 8/11/2019 Statistics for association study

    12/91

    'est o%"ssociation Single SNPassociation

    Chi squaretest

    "rmitage test

    #isher s eAacttest

    5eneral

    linear modelogistic

    regressionmodels

    -Balding, &**901&

  • 8/11/2019 Statistics for association study

    13/91

    'est o%"ssociation Multiple SNPassociation

    MD7

    SNP setassociation

    ogistic7egression

    Haplotypeased

    regressionmodel

    1)

  • 8/11/2019 Statistics for association study

    14/91

    SNP 4uality Control 'he quality control -4C0 fltering o% single nucleotide

    polymorphisms -SNPs0 is an important step especially in genome$ide association studies to minimiJe potential %alse fndings

    SNP 4C commonly uses eApert guided flters ased on 4C+aria les, to remo+e SNPs $ith insuKcient genotyping quality,such as ( Hardy86ein erg equili rium ( missing proportion -MSP0 ( minor allele %requency -M"#0

    #ollo$ing are some o% the criteria %or SNP 4C-i0 percentage o% SNPs eAcluded due to lo$ quality-ii0 inLation %actor o% the test statistics - )-iii0 num er o% %alse associations %ound in the fltered dataset-i+0 num er o% true associations missed in the fltered dataset

    -Pongpanich et al., &*1*01;

  • 8/11/2019 Statistics for association study

    15/91

    SNP quality control -4C0 is commonly sa%eguarded y super+ised -i eeApert guided0 flters to eAclude lo$ quality SNPs

    'he super+ised eApert flters aim to remo+e SNPs that %all into theeAtremes o% 4C +aria les including Hardy86ein erg equili rium -H6@0,missing proportion -MSP0 and minor allele %requency -M"#0

    'he rationale is clear

    ( eAtreme de+iation %rom H6@ is typically used to identi%y grossgenotyping error -'eo et al., &**/0

    ( a high MSP indicates poor genotype pro e per%ormance and lo$genotyping accuracy -Neale and Purcell, &**FG 6'CCC, &**/0

    ( SNPs $ith lo$ M"# are more prone to error, as %e$er samples $oulde $ithin a genotype cluster and most clustering ased callingalgorithms do not per%orm $ell $ith rare alleles -Neale and Purcell,&**FG 'eo, &**F0

    12

  • 8/11/2019 Statistics for association study

    16/91

    #or single SNP analyses, i% a %e$ genotypes are missing there isnot much pro lem

    #or multipoint SNP analyses, missing data can e morepro lematic ecause many indi+iduals might ha+e one or more

    missing genotypes

    ne con+enient solution is data imputation replacing missinggenotypes $ith predicted +alues that are ased on the o ser+edgenotypes at neigh ouring SNPs

    5enotype imputation is the term used to descri e the process o%predicting or imputing genotypes that are not directly assayedin a sample o% indi+iduals

    Missing 5enotypicImputation

    19

    !Balding) *++,'

  • 8/11/2019 Statistics for association study

    17/91

    'here are se+eral distinct scenarios in $hich genotype imputation isdesira le, ut the term no$ most o%ten re%ers to the situation in$hich a re%erence panel o% haplotypes at a dense set o% SNPs is

    used to impute into a study sample o% indi+iduals that ha+e eengenotyped at a su set o% the SNPs- Marchini and Ho$ie, &*1*0

    Imputation methods $or! y com ining a re%erence panel o%indi+iduals genotyped at a dense set o% polymorphic sites -usuallysingle nucleotide polymorphisms, or SNPs 0 $ith a study samplecollected %rom genetically similar population and genotyped at asu set o% these sites -Ho$ie et al., &**=0

    Imputation methods either see! a est prediction o% a missing

    genotype, such as a maximum likelihood estimate -singleimputation0, or randomly select it %rom a pro a ility distri ution(multiple imputations 0 -Balding, &**90

    'he goal is to predict the genotypes at the SNPs that are notdirectly genotyped in the study sample

    1/

  • 8/11/2019 Statistics for association study

    18/91

    "": "C : CC

    "": "': ''

    55: 5' : ''

    "" : "5: 55

    "": "C: CC

    CC: C5: 55

    "C

    5'

    ""

    55

    ":C * * 1 * * *

    ":' * & * 2 * )

    5:' * * 1 * * *

    ":5 1 * * * * *

    ":C * 1 * * * =

    C:5 * * * * 1 *

    ser+ed5enotypes

    Imputation7e%erence

    Predicted

    5enotypes

    Some

    "lgorithms

    PosteriorPro a ility

    Imputation o%5enotypes

    1F

  • 8/11/2019 Statistics for association study

    19/91- Marchini and Ho$ie, &*1*0

    1=

  • 8/11/2019 Statistics for association study

    20/91

    Genot$%e I&%utation Met'ods (o! it )or*s+

    I !"#$ v% @Atension o% HMM

    I !"#$ v& More LeAi le than +1 , SNP

    di+ided into t$o sets Set '

  • 8/11/2019 Statistics for association study

    21/91

    .ses o%Imputation

    Boosting Po$er

    #ineMappin

    g

    Meta

    "nalysis

    Imputation o%

    untyped+ariation

    Imputation o%

    Non SNP+ariation

    Correction o%

    genotyping

    +ariation

    - Marchini and Ho$ie, &*1*0&1

  • 8/11/2019 Statistics for association study

    22/91

    Single ocus associationanalysis

    &&

  • 8/11/2019 Statistics for association study

    23/91

    Pearson goodness o% ft testCategorical data may e displayed in contingency ta les

    'he chi square statistic compares the o ser+ed count in eachta le cell to the count $hich $ould e eApected under theassumption o% no association et$een the ro$ and columnclassifcations

    'he chi square statistic may e used to test the hypothesis o% noassociation et$een t$o or more groups, populations, or criteria

    &)

  • 8/11/2019 Statistics for association study

    24/91

    #or a single SNP $ith alleles " and B tested in a case controlstudy, the data generated consist o% siA counts o% thenum ers o% genotypes -"", "B and BB0 in cases and controls

    ser+ed +alue %or "" genotypes in cases, 1 a

    @Apected +alue %or "" genotypes in cases, @ 1Chi Square statistic

    AA A, ,, Tota

    Cases a C n case -7 10

    Controls d e % ncont

    -7 &0

    'otal n "" -C 10 n"B -C &0 nBB -C ) 0 ONa. Fu "enot$%e ta/ e for a "enera "enetic&ode

    &;

  • 8/11/2019 Statistics for association study

    25/91

    AA A,0,,

    Case a cControl d e %

    A ,

    Case &a &cControl &d e e &%

    AA0A, ,,

    Case a CControl d e %

    - 0 Dominant model allele B increases ris!

    -c0 7ecessi+e model t$o copies o% allele B required %orincreased ris!

    -d0 Multiplicati+e model r %old increased ris! %or "B, r & increased ris! %orBB "nalysed y allele, not y genotype

    - e$is, &**&0&2

  • 8/11/2019 Statistics for association study

    26/91

  • 8/11/2019 Statistics for association study

    27/91

    @Aactly n "B heteroJygotes

    'hus, under the assumption o% H6@, the pro a ility o%o ser+ing eAactly n A* hetero+ -otes in a sample o% N

    individuals with n A minor alleles is

    'his equation holds %or each possi le num er o%heteroJygotes, n A* .

    -6igginton et al. ,&**20 &/

  • 8/11/2019 Statistics for association study

    28/91

    'he eApression %or P-N "B O n "B QN, N" 0 gi+en in equationleads to natural tests %or H6@

    ne sided test Defcit o% heteroJygotes, P lo$ O P-N "B R n "B QN, N" 0 -In reeding,Stratifcation0@Acess o% heteroJygotes, P high O P-N "B n "B QN, N" 0 -5enotypingerror0

    In each case, the statistic can e calculated y simplysumming o+er equation, to include all possi le +alues o% N "B that are lo$er -%or P lo$ 0 or higher -%or P high 0 than thoseo ser+ed in the actual data

    -6igginton et al. ,&**20

    &F

  • 8/11/2019 Statistics for association study

    29/91

    Control genotypes should e in Hardy86ein erg equili rium,pro+ided the population they are selected %rom is random matingand is large in siJe

    Suppose the population %requency o% allele " is p and allele B is qO1 p, then the genotypes "", "B and BB should ha+e %requency p &,&pq and q &

    Pro+ided the controls are in H6@, the cases may then e tested I%the SNP has a true genetic e?ect that is no controlled y amultiplicati+e model, the cases $ill not e in H6@ -although again,the test has little po$er to detect small departures %rom H6@0 I% thecases are in H6@, the data may e analysed y allele counting, asany genetic e?ect is consistent $ith a multiplicati+e model

    " signifcant result sho$ing that controls are not in Hardy86ein ergequili rium -H6@0 could arise ecause o% ( random chance ( genotyping pro lems ( heterogeneous population - e$is, &**20

    &=

  • 8/11/2019 Statistics for association study

    30/91

    )*

  • 8/11/2019 Statistics for association study

    31/91

    T'e Odds Ratio a Measure ofAssociation

    " use%ul statistic %or measuring the le+el o% association incontingency ta les is the odds ratio , .I% the odds are equal, their ratio equals one " sample estimator o%the odds ratio , R is

    dd 7atio " measurement o% association that is commonly used in

    case control studies It is defned as the odds o% eAposure to thesuscepti le genetic +ariant in cases compared $ith that incontrols I% the odds ratio is signifcantly greater than one, then thegenetic +ariant is associated $ith the disease

    -6ang et al.,&**20

    D -T0 O

    )1

  • 8/11/2019 Statistics for association study

    32/91

    Confdence Inter+al < Interpretation

    Standard error is +ery much necessary to fnd confdenceinter+al %or null hypothesis o% no association

    CI %or 7 O 7U1 =9V 7VWCI %or D O DU 1 =9V W

    SNP has no inLuence on disease i% the =2X CI %or 7includes 1 or CI %or D includes *

    )&

  • 8/11/2019 Statistics for association study

    33/91

    ))

  • 8/11/2019 Statistics for association study

    34/91

    "rmitage s 'rend test 'he disad+antages o% Population stratifcation and con%ounding %actoris o+ercomed, to some eAtent, y applying the "rmitageYs trend test,as suggested y "rmitage -1=220, Sasieni -1==/0, and Schaid and

    Zaco sen -1===0

    'here are three common choices o% scoring system10 co dominant score A * O *, A 1 O 1, and A & O &G&0 dominant score A * O *, A 1 O 1, and A & O 1G)0 recessi+e score A * O *, A 1 O *, and A & O 1

    Here, the names o% scoring systems are in %a+our o% the minor allele[m\

    Genot$%es

    MM Mm mm 'otal

    Case n 1* n 11 n 1& N1Control n ** n *1 n *& N* 'otal N * N 1 N & N

    Score ] * ] 1 ] &

    );

  • 8/11/2019 Statistics for association study

    35/91

  • 8/11/2019 Statistics for association study

    36/91

    #igure & Q Ar&ita"e test of sin" e-SNP association !it'case2contro outco&e3

    'he dots indicate the proportion o%cases, among cases and controlscom ined, at each o% three SNPgenotypes -coded as *, 1 and&0,together $ith their least squares line

    'he "rmitage test corresponds totesting the hypothesis that the linehas Jero slope Here, the line fts thedata reasona ly $ell as theheteroJygote ris! estimate isintermediate et$een the t$ohomoJygote ris! estimatesG thiscorresponds to additi+e genotyperis!s

    'he test has good po$er in this case

    ut po$er is reduced y de+iations%rom additi+ity

    In an eAtreme scenario, i% the t$ohomoJygotes ha+e the same ris! utthe heteroJygote ris! is di?erent-o+er dominance0, then the "rmitagetest $ill ha+e no po$er %or any

    -Balding, &**90

    )9

  • 8/11/2019 Statistics for association study

    37/91

    'ransmission Disequili rium 'est

    'he 'D' tests %or oth lin!age and association in %amilies $itho ser+ed transmissions %rom parents to a?ected o?spring-Spielman et al., 1==)0

    It $as originally de+eloped to test %or lin!age in the presence o%association, ut its most common usage is no$ to test %orassociation in the presence o% lin!age, since it is ro ust againstpopulation stratifcation

    'he 'D' tests %or distortion in transmission o% alleles %rom aheteroJygous parent to an a?ected o?spring

    )/

    !-e$is) *++*'

  • 8/11/2019 Statistics for association study

    38/91

    )F

    !-e$is) *++*'

  • 8/11/2019 Statistics for association study

    39/91

    Continuous outcomes inear7egression -5 M0

    inear models are used to study ho$ a quantitati+e +aria ledepends on one or more predictors or eAplanatory +aria les

    'he predictors themsel+es may e quantitati+e or qualitati+e-7odrigueJ, &**/0

    y = 0 + 1 x + $here

    / dependent varia0le x / independent varia0le 0 , 1 / re-ression parameters

    1 / random error

    )=

  • 8/11/2019 Statistics for association study

    40/91

    So%t$are used S"S F *&-P7 C 5 M0

    ;*

  • 8/11/2019 Statistics for association study

    41/91

    5eneraliJed inear Model5eneraliJed linear models -5 Ms0 are a large class o% statisticalmodels %or relating responses to linear com inations o% predictor+aria les, including many commonly encountered types o%dependent +aria les and error structures as special cases

    -Za!man, &**&0

    "d+antages o% using 5 Ms

    ( No need to trans%orm the data into normality ( 5 Ms uni%y a $ide +ariety o% statistical methods

    " 5 M generaliJes ordinary regression models in t$o $ays #irst, itallo$s 2 to have a distri0ution other than the normal. econd, itallo$s modeling some %unction o% the mean

    Both generaliJations are important %or categorical data-"gresti, &**/0

    ;1

  • 8/11/2019 Statistics for association study

    42/91

    5 Ms %orinary data

    ogit Model

    ogit in!

    Pro it Model

    Pro it in!

    'rans%orm to^ scores%rom snd

    ;&

  • 8/11/2019 Statistics for association study

    43/91

    ogit %or single SNP@ach su 3ect in our sample consists o% a -y iG Ai0 pair $here y i is case:control status -1:*0 and A i -*,1,&0 is the genotype attyped locus Genot$

    %e4 i Odds Para&e

    tersaa * _ ` *"a 1 _ -1 0 ` 1"" & _

    -1 0 &` &

    -7e% A%ord .ni+ersity 6e site, http ::$$$ stats oA ac u!:Emc+ean:g$a; pd% 0 ;)

    http://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/gwa4.pdfhttp://www.stats.ox.ac.uk/~mcvean/
  • 8/11/2019 Statistics for association study

    44/91

    No$ trans%ormation logit - 3) / lo- (3 4 (% 5 3)) is applied to 3 i , thedisease risk o' the i6th individual.

    'he +alue o% logit -b i0 is equated to either ` * , ` 1 , or ` &, according to thegenotype o% indi+idual i -` 1 %or heteroJygotes0

    'he li!elihood ratio test o% this general model, against the nullhypothesis ` *O` 1O ` & , has & d % ,and %or large sample siJes isequi+alent to the Pearson & d% test

    .sers can impro+e the po$er to detect specifc disease ris!s, at the

    cost o% lo$er po$er against some other ris! models, y restricting the+alues o% ` *, ` 1 and ` &

    'ests %or recessi+e or dominant e?ects can e o tained y requiringthat ` * O ` 1 or ` 1 O ` & -Balding, &**90

    ;;

    ogistic 7egression o%

  • 8/11/2019 Statistics for association study

    45/91

    ogistic 7egression o%Melanoma status on

    5enotypeRis* Factor Odds Ratio 95 6I P 7a ue

    Models $ithout Co+ariateSNP no o%copies ['\alleles

    * /F * 9/ * =) * **;

    Models $ith intermediate %actor as co+ariateSNP no o%copies o% ['\

    allele

    * F= * /; 1 */ * &)

    Ne+us count & 9* & &F & =/ 1* ;)

    ;2

    ! eggini and /orris) *+00'

  • 8/11/2019 Statistics for association study

    46/91

    'est o% association Multiple SNPs

    ;9

  • 8/11/2019 Statistics for association study

    47/91

    Set association, to e+aluate sets o% SNP mar!ers at +arious positions

    in the genome -in particular, in di?erent suscepti ility genes0

    'his method per%orms a simultaneous signifcance test on se+eral setso% loci $hile !eeping the o+erall type I error in control

    SNP set ased analysis orro$s in%ormation %rom di?erent utcorrelated SNPs that are grouped on the asis o% prior iological!no$ledge and hence has the possi ility o% pro+iding results $ithimpro+ed reproduci ility and increased po$er, especially $henindi+idual SNP e?ects are moderate, as $ell as impro+edinterpreta ility

    'o increase the po$er o% the test, sometime it is %easi le to com inerele+ant sources o% in%ormation %or a gi+en SNP, such as

    "llelic association -""0, Hardy 6ein erg disequili rium -H6D0, ande+idence %or genotyping errors -Heidema et al ,

    SNP set analysis

    -6u et al., &*1*0

    ;/

  • 8/11/2019 Statistics for association study

    48/91

    'his mode o% analysis proceeds +ia a t$o step procedure ( SNP are assigned to set on the asis o% some meaning%ul

    iological criteria -genomic %eatures0 e g 5enes

    ( 'hen, tests %or the association et$een each genomic %eatureand a disease phenotype are per%ormed $ith the use o% alogistic !ernel machine ased multilocus test, across thegenome

    SNP set analysis can pro+e ad+antageous o+er the standardanalysis o% indi+idual SNPs By %orming SNP sets and testing eachSNP set as a unit, $e are reducing the num er o% hypotheses eingtested and thus relaAing the stringent conditions %or reachinggenome $ide signifcance in case o% 56"

    'here are %ollo$ing $ays o% grouping SNPs into set ( SNP location in the gene as or near to gene -gene ased

    set analysis0 ( Set %ormation on the asis o% @55 path$ay ( 5roup SNPs onto e+olutionary conser+ed regions ( 5rouping SNPs into haplotype loc!s

    -6u et al., &*1*0

    ;F

  • 8/11/2019 Statistics for association study

    49/91

    5enome $ide SNP set testing

    "ssume population ased case control status -#or a single set0 ( let J i1, J i&, , J ip e genotype +alues %or the SNPs in the SNP set

    %or the I th su 3ect -i O 1, ,n0 ( 'he case control status %or the i th su 3ect is denoted y y i -y i

    O 1 %or cases, and yO * %or controls0 ( J i3 O *, 1, & corresponding to homoJygotes %or the ma3or allele,

    heteroJygotes, and homoJygotes %or the minor allele,respecti+ely

    ( #urther assume collection o% m additional set o% demographic,en+ironmental and other con%ounding +aria les

    #or the i th su 3ect let A i1, Ai&, , A im denote the +alues o% theco+ariates that $e $ould li!e to ad3ust %or

    'he goal o% SNP set analysis is then to test the glo al null o% $hetherany o% the p SNPs are related to the outcome $hile ad3usting %or theadditional co+ariates

    -6u et al., &*1*0;=

  • 8/11/2019 Statistics for association study

    50/91

    ogistic ernel Machine 7egressionModel 'he !ernel machine %rame$or! has ecome +ery popular %ormodelling high dimensional iomedical data ecause o% its a ility toallo$ %or compleA:nonlinear relationships et$een the dependent andindependent +aria les -Bro$n et al., &***0 $hile ad3usting %orco+ariate e?ects.nder the logistic ernel Machine 7egression Model, #ollo$ing is the

    model %or SNP 3oint interaction and considering other co+ariates

    ( In $hich _ * is the intercept ( _ 1, _ &, , _ m are regression coeKcients corresponding to the

    en+ironmental and demographic co+ariates ( 'he SNPs, J i1 , , J ip, inLuence y i through the general %unction h- 30,

    $hich is an ar itrary %unction that that has a %orm defned only ya positi+e, semi defnite !ernel %unction - 3,30-6u et al., &*1*0

    2*

  • 8/11/2019 Statistics for association study

    51/91

    MD7 is a nonparametric data mining approach 'o reduce t$o or more SNPs, %or eAample, to a ne$ single+aria le that is then e+aluated using a classifer such asBayes or logistic regressionIn MD7, each multi locus genotype o% a SNP com ination is

    assigned to a high ris! or lo$ ris! group, depending on theratio o% cases and non cases $ith this multi locus genotypeI% this ratio eAceeds a certain threshold, this multi locusgenotype is assigned to as high ris!, other$ise it isassigned to as lo$ ris!

    By assigning all multi locus genotypes %or a certaincom ination o% SNPs to either high ris! or lo$ ris!, MD7reduces the num er o% multi locus genotypes to one ris!%actor consisting o% t$o le+els, high ris! or lo$ ris!

    'he aim is to construct a ne$ ris! %actor that %acilitates the

    detection o% nonlinear interactions among SNPs such thatthe rediction o% the outcome +aria le is im ro+ed o+er the

    Multi Dimensional 7eduction

    -7itchie et al., &**1021

  • 8/11/2019 Statistics for association study

    52/91

    - ee et al. ,&**F02&

  • 8/11/2019 Statistics for association study

    53/91

    2)

  • 8/11/2019 Statistics for association study

    54/91

    ogistic regressionogistic regression analyses %or SNPs are a natural eAtension o% thesingle SNP analyses that are discussed in pre+ious slides there is no$a coe%%icient -`*, `1 or `&0 %or each SNP, leading to a general test $ith& d% By constraining the coeKcients, tests $ith d% can e o tained

    Co+ariates such as seA, age or en+ironmental eAposures are readily

    included Similarly, interactions et$een SNPs can e included-Balding, &**90

    'his con+eys little eneft, and can reduce po$er to detect anassociation, i% there is a single underlying causal +ariant and little orno recom ination et$een SNPs, ut it is potentially use%ul %orin+estigating epistatic e?ects

    -6u et al., &*1*0

    2;

  • 8/11/2019 Statistics for association study

    55/91

  • 8/11/2019 Statistics for association study

    56/91

    tSNP can sometimes pro+ide greater analytical po$er thansingle mar!er analysis %or genetic association studies

    'his is ecause haplotypes are inherited together in thema3ority o% cases, and they incorporate linkagedisequilibrium in%ormation -"!ey and ]iong, &**)G Schaid,&**;0

    Con+ersely, haplotype ased statistical analysis has a$ea!ness since haplotypes are o%ten not directlyo ser+a le

    Hence, haplotypes and their %requencies are in%erred ystatistical methods such as the @Apectation MaAimiJation-@M0 algorithm -Dempster et al., %7889 @AcoKer andSlat!in, 1==20 or the Bayesian method -Stephens et al.,&::%9 in et al., &::&).

    29

  • 8/11/2019 Statistics for association study

    57/91

    5i+en haplotype assignments, the simplest analysis in+ol+estesting %or independence o% ro$s and columns in a & kcontin-enc ta0le, where k denotes the num0er o% distinct

    haplotypes -Sham, 1==F0

    "lternati+e approaches can e ased on the estimatedhaplotype proportions among cases and controls, $ithout aneAplicit haplotype assignment %or indi+iduals -Schaid, &**;0

    the test compares the product o% separate multinomialli!elihoods %or cases and controls $ith that o tained ycom ining cases and controls

    Haplotype ased regression model is +ery use%ul in

    haplotype ased association study

    2/

  • 8/11/2019 Statistics for association study

    58/91

    !1ang et al., *++2'

    2F

    7 g i M d l %

  • 8/11/2019 Statistics for association study

    59/91

    7egression Models %orHaplotypes

    6ithin the %rame$or! o% the generaliJed linear model -5 M0, thehaplotype e?ect on traits can e statistically descri ed and tested

    'he model can e eApressed as @- 0 O % 1 -3 ')$here denotes the trait

    ] represents the haplotypes that are coded into the desi-nmatrix )

    denotes the e?ects o% haplotype, and% is a %unction that generaliJes the usual linear regression

    such as logistic regression in the case control study

    !4ohee et al., *++5'

    2=

  • 8/11/2019 Statistics for association study

    60/91

    et O * %or Control and 1 %or Case

    et -h i, h 30 e a random +aria le that denotes the pair o%

    haplotypes %or each indi+idual, iO3 or i 3 et H O h 1 , h & , , h pe a set o% haplotypes

    MaAimum num er o% possi le haplotypes is & m , $here m is thenum er o% SNPsIn association studies, the main interest lies in estimating thee?ects o% H on So, in nutshell, regression models %or Haplotypes consists o%

    Predictor Haplotype counts7egression Parameters Phenotypic e?ect o% eachhaplotypesutcome 'he phenotype o% interest

    !4ohee et al .) *++5'

    9*

  • 8/11/2019 Statistics for association study

    61/91

    Direct Design MatriA

    Indi7idua

    (a% ot$%es

    Pro/a/iit$

    Direct Desi"n Matri4

    h 1 h & h ) h ;

    1 -h 1,h 10 1 1 * * *

    & -h1, h ; 0 * & * 1 * ; * ; * 1-h &, h ) 0 * F

    ) -h &, h &0 1 * 1 * *

    ; -h &, h ; 0 1 * * 2 * * 2

    2

    -h 1, h &0 * &* &2 * &* * &2 * )*-h 1, h ; 0 * )

    -h &, h ) 0 * )

    -h &, h ; 0 * & 'he direct type o% design matriA relies on the estimated haplotypepro a ilities -proportions0

    !4ohee et al., *++5'91

    I di D i M iA

  • 8/11/2019 Statistics for association study

    62/91

    Indirect Design MatriA

    Indi7idua (a% ot$%es Pro/a/iit$

    Indirect Desi"n Matri4

    h 1 h & h ) h ; 6eight

    1 -h 1,h 10 1 & * * * 1

    &-h 1, h ; 0 * & 1 * * 1 * &

    -h &, h ) 0 * F * 1 1 * * F ) -h &, h &0 1 * & * * 1

    ; -h &, h ; 0 1 * 1 * 1 1

    2

    -h 1, h &0 * & 1 1 * * * &

    -h 1, h ; 0 * ) 1 * * 1 * )-h &, h ) 0 * ) * 1 1 * * &

    -h &, h ; 0 * & * * 1 1 * )

    !4ohee et al., *++5'9&

    I d i B i " h

  • 8/11/2019 Statistics for association study

    63/91

    Introduction to Bayesian "pproach

    " statistical school o% thought that holds that in%erences a out anyun!no$n parameter or hypothesis should e encapsulated in apro a ility distri ution, gi+en the o ser+ed data Computing thisposterior pro a ility distri ution usually proceeds y speci%ying aprior distri ution that summariJes !no$ledge a out the un!no$ne%ore the o ser+ed data are considered, and then using Bayestheorem to trans%orm the prior distri ution into a posteriordistri ution

    Bayesian methods pro+ide an alternati+e approach to assessingassociations that alle+iates the limitations o% p;values at the cost o'some additional modellin- assumptions

    Bayesian methods compute measures o% e+idence that can edirectly compared among SNPs $ithin and across studies

    9)

    -Stephens < Balding,&**=0

    C l l i P ili i % " i i

  • 8/11/2019 Statistics for association study

    64/91

    Calculating Pro a ilities o% "ssociation

    'his deals $ith computing, %or each SNPs in 56"S, the pro a ilitythat it is truly associated $ith the phenotype a!a [ PosteriorPro a ility o% "ssociation -PP"0\

    'his posterior pro a ility o% association -PP"0 can e thought o% asthe Bayesian analogue o% a p;value o0tained, 'or eAample, y

    using the "rmitage trend test -"''0 or the #isher eAact test

    'he calculation o% PP" can e split into three di?erent steps ( Choose a +alue %or b, the prior pro a ility o% H 1 ( Compute a Bayes %actor %or each SNP

    ( Calculate the posterior odds on H 1

    -Stephens < Balding,&**=0

    9;

    Step I Choose a +alue %or b the prior pro a ility o% H

  • 8/11/2019 Statistics for association study

    65/91

    Step I Choose a +alue %or b, the prior pro a ility o% H 1 ( b +alue quanti%ies our prior assumption o% each SNPs

    eing associated

    ( alue o% b %or H 1 depends on prior !no$ledge, %oreAample M"#, ProAimity to certain genes o% interestetc

    ( i% b is assumed to e the same %or all SNPs, it can einterpreted as a prior estimate o% the o+erall proportion

    o% SNPs that are truly associated $ith a phenotype ( 'ypically, only a minority o% SNPs is eApected to e trulyassociated $ith a gi+en phenotype the range 1* 8; to 1* 89 has een suggested %or 3. -'he 6ellcome 'rustCase Control Consortium, &**/0

    ( 'he pro a ility o% H * is ta!en to e 1 8 3.Step II Compute a Bayes %actor %or each SNP

    ( " Bayes %actor -B#0 is the ratio et$een the pro a ilitieso% the data under H 1 and under H *

    ( 'he B# is similar to a li!elihood ratio, ut it comparest$o di?erent models rather than t$o parameter +alues

    -Stephens < Balding,&**=0 92

    Steps III Calculate the posterior odds on H

  • 8/11/2019 Statistics for association study

    66/91

    ( 'he B# and b can e used to compute posterior odds on H 1

    ( 'his can e used to calculate PP"

    ( 'he PP" can e interpreted directly as a pro a ility, irrespecti+e o%po$er, sample siJe or ho$ many other SNPs $ere tested

    ( Intuiti+ely, the PP" com ines the e+idence in the o ser+edassociation data -the B#0 $ith the prior pro a ility -b0 that a SNP istruly associated $ith phenotype Because b is typically so small, theB# has to e large -%or eAample, j1*; 8 1*90 to pro+ide con+incinge+idence %or an association -that is, to gi+e a PP" close to 10

    ( 'he requirement %or a large B# is analogous to setting a stringentthreshold %or genome $ide signifcance in a %requentist approach

    Steps III Calculate the posterior odds on H 1

    -Stephens and Balding,&**=0

    99

  • 8/11/2019 Statistics for association study

    67/91

  • 8/11/2019 Statistics for association study

    68/91

    Population stratifcation is pro a ly the most o%ten cited reason %ornon replication o% genetic association results, $hich ha+eun%ortunately een more the rule than the eAception-'a or et al., &**), 6eiss and 'er$illiger, &***0

    eading scientifc 3ournals ha+e noted the importance o% populationstratifcation as a cause o% non replicated association outcomes,-"non, 1===0 and it is usual practice in grant applications andmanuscript peer re+ie$ to demand that stratifcation is eAplicitly

    addressed -5auderman, 1===0

    '$o circumstances must e met %or population stratifcation toa?ect genetic association studies

    i Di?erences in disease pre+alence must eAist et$een cases and

    controlsG andii +ariations in allele %requency et$een groups must e present -Stephen et al., &**)0

    9F

  • 8/11/2019 Statistics for association study

    69/91

    4 4 P '

    -McCarthyetal ,&**F0

    9=

  • 8/11/2019 Statistics for association study

    70/91

  • 8/11/2019 Statistics for association study

    71/91

    'he e?ect o% stratifcation on association studies - a. Strati8cationinf ates : association statistics /$ a factor ;< !'ic' c'an"esde%endin" on t'e sample siJe Scenario 1 corresponds to grossstratifcationG scenarios & and ) correspond to the range o%stratifcation estimated in the "%rican "merican prostate cancer studyGand scenario ; corresponds to no stratifcation -#reedman et al.,

    &**;0 /1

  • 8/11/2019 Statistics for association study

    72/91

    In 5enomic Control -5C0 the "rmitage test statistic iscomputed at each o% the null SNPs, and k is calculated asthe empirical median di+ided y its eApectation under the

    &1 distri ution-De+lin and 7oeder, 1===0

    'hen the "rmitage test is applied at the candidate SNPs,and i% k j 1 the test statistics are di+ided y k

    #he motivation 'or G> is that, as we expect 'ew i' an o' thenull N!s to 0e associated with the phenot pe, a value o' ? % is likel to 0e due to the e

  • 8/11/2019 Statistics for association study

    73/91

  • 8/11/2019 Statistics for association study

    74/91

    ther approachesNull SNPs can mitigate the e?ects o% population structure $henincluded as co+ariates in regression analyses

    -Seta!is et al., &**90

    i!e 5C, this approach does not eAplicitly model the populationstructure and is computationally %ast, ut it is much more LeAi le

    than 5C ecause epistatic and co+ariate e?ects can e included inthe regression model-Balding, &**90

    @mpirically, the logistic regression approaches sho$ greater po$erthan 5C, ut their type 1 error rate must e assessed throughsimulation -Seta!is et al., &**90

    /;

    Multiple testing

  • 8/11/2019 Statistics for association study

    75/91

    Multiple testingIt re%ers to the pro lem that arises $hen many nullhypotheses are testedG some signifcant results are li!ely

    e+en i% all the hypotheses are %alse-Balding, &**90

    @specially in 56", @ach SNP that is analyJed constitutesone hypothesis test In traditional hypothesis testing, thesignifcant le+el is o%ten set at 2X

    Ho$e+er, as sho$n in the ta le elo$ as the num er o%SNPs tested increases, the num er o% SNPs %alsely claimedto e signifcant increases, pro+ided that all the SNPs arenon signifcantNo3 of SNP tested Fa se %ositi7e

    1** 21*,*** 2**2,**,*** &2,***

    -Scott,&**=0

    /2

  • 8/11/2019 Statistics for association study

    76/91

  • 8/11/2019 Statistics for association study

    77/91

    Sequential Bon%erroni re3ection 'his $as proposed y Holm -1=/=0

    'he method is ased on the Bon%erroni test and requiresthe type I error to e as small as possi le

    Philosophically, %or each o% these n tests, the pro a ility o%committing a type I error is less than or equal to a smallpredetermined +alue

    Notation $hich $e $ill used %urther in this method

    n null hypothesis H 1 , H &, H ) , ,H n"lternati+e hypothesis 1 , &, ) , , n

    'est statistics 1 , &, ) , , nn critical region C 1 , C &, C ) , , c n -Scott,

    &**=0//

  • 8/11/2019 Statistics for association study

    78/91

    No$ let the corresponding p +alues generated %rom the teststatistics, 1 , &, ) , , n , e P 1 , P &, P ) , , P n $here !O1, &, n

    6hen these p +alues are ordered, P -10RP -&0RP -)0 R RP -n0,along $ith their corresponding hypotheses,H-10RH -&0RH -)0 R RH -n0, the most signifcant ones $ouldha+e the smallest p +alues

    'he S7B method attempts to sol+e this multiple testingpro lem y ad3usting the signifcant le+el , %or eachhypotheses tested, e%ore comparing it $ith the p +aluesSpecifcally, these p +alues are compared to corresponding

    le+els denoted y

    'he hypotheses are re3ected until no other re3ections arepossi le Since the most important hypotheses $ould ha+ethe smallest p +alues, they are compared $ith the smallest

    -Scott,

    &**= 0/F

    Conclusions

  • 8/11/2019 Statistics for association study

    79/91

    Conclusions#amily ased lin!age mapping is the est approach to detect

    region in+ol+ed in recessi+e high penetrant Mendelian

    disease and has lo$ resolution $hile population ased

    genetic association study helps in mapping compleA traits

    and has high resolution

    Case Control design applies to disease traits and t$o tailedsampling design applies to quantitati+e traits

    Huge genotyping data due to increased usage o% 56"S

    study and spurious association due to study design has put

    %or$ard computational challenges

    H6@, MSP and M"# are the main components in 4C and

    imputation methods uses HMM, M , Multiple imputation %or

    imputing missing alleles /=

    ogistic regression can ad3ust the e?ect o% co+ariates gi+ing

  • 8/11/2019 Statistics for association study

    80/91

    g g g g

    an ad+antage o+er other categorical data tests

    In case o% quantitati+e trait linear regression model and its

    +ariant such as fAed, random or miAed model can e

    applied depending on the types o% independent +aria les

    Multiple SNPs model apart %rom logistic regression and

    linear regression uses the SNP set analysis $hich gi+esetter result using !ernel %unction and it also uses machine

    learning language methods such as MD7, 7#"

    Bayesian models computes B# and uses +arious models to

    detect the PP" $hich is the analogue o% P +alue in

    %requentists approach

    Population stratifcation is one o% the main reasons ehind

    non replication o% 56"S in case control design F*

    Haplotype ased regression models alle+iate the pro lem o%

  • 8/11/2019 Statistics for association study

    81/91

    Haplotype ased regression models alle+iate the pro lem o%

    multiple testing and also increases the po$er o% test

    5enomic control is the $idely used method to detect

    stratifcation

    Multiple testing requires the application o% Bon%erroni

    correction and #D7 to control type I error

    F1

  • 8/11/2019 Statistics for association study

    82/91

    T'an* >ou[In 5od $e trust, all others must ringdata\

    67. 8d$ards Deming

    F&

  • 8/11/2019 Statistics for association study

    83/91

    Supplementary Slides

  • 8/11/2019 Statistics for association study

    84/91

    Data 4uality Control

    #or 56"S conducted y the 6ellcome 'rust CaseControl Consortium -6'CCC0, the criteria %orretaining a SNP are H6@ !;value@ .8B%:58,MSPR2X i% M"# 2X, MSPR1X i% M"# 2X and

    M"# ?:.:% -6'CCC, &**/0 Slade! et al. (&::8)included N!s when the C$ !;value?:.::%,!D E and AF?:.:%. "noki et al. (&:: )

    included SNPs $hen the H6@ !;value @%:5H and! D%:E.tatistical method used PC"

  • 8/11/2019 Statistics for association study

    85/91

    5enotyping @rror

    ariation in DN" sequenceo$ quantity and quality o% DN"Biochemical arti%act and lo$ quality reagentHuman #actor

    $

  • 8/11/2019 Statistics for association study

    86/91

    SNP 4C so%$are

    S"4C -SNP array 4uality Control0 on 7DBSC"N

  • 8/11/2019 Statistics for association study

    87/91

    PC"

    is a mathematical procedure that usesan orthogonal trans%ormation to con+ert a set o%o ser+ations o% possi ly correlated +aria les intoa set o% +alues o% linearly uncorrelated +aria les

    called %rinci%a co&%onents

    H l t l i

  • 8/11/2019 Statistics for association study

    88/91

    Haplotype analysis

    "nalysis methods ased on single SNPs ha+elimited po$er to detect a true genetic e?ect thatrequires a specifc allele at se+eral SNPs 'hismay e detected using haplotype ased methods,

    analysing all SNPs concurrently

    5enehunter allo$s haplotype analysis o% up to%our SNPs ne o% the most LeAi le programs %or

    'D' type analysis is 'ransmit

    So%t$ares%or 'D' and Haplotype

  • 8/11/2019 Statistics for association study

    89/91

    So%t$ares%or D and Haplotypeanalysis

    'D':Si 'D'http ::genomics med upenn edu:spielman:'D' htm5enehunterhttp ::$$$ %hcrc org:la s:!ruglya!:Do$nloads:indeA htmlPD' http ::$$$ chg du!e edu:so%t$are:pdt html@'D'http ::$$$ mds qm$ ac u!:statgen:dcurtis:so%t$are html

    'ransmit http ::%tp gene cimrcam ac u!:clayton:so%t$are :@H %tp::lin!age roc!e%eller edu:so%t$are:eh:

    @hplushttp ::$$$ iop !cl ac u!:IoP:Departments:PsychMed:5@piBSt:so%t$ar

    @A l % gi ti 7 g i

  • 8/11/2019 Statistics for association study

    90/91

    @Aample o% ogistic 7egression

  • 8/11/2019 Statistics for association study

    91/91

    $here x iT i is the ith row of the design matrix. With suitable choice of designmatrix, the regression coefficients) , are the logarithms of the odds ratio

    parameters