Study Design in Human Genetics and Genomics Gary Beecham, PhD John P. Hussman Institute for Human...

Study Design in Human Genetics and Genomics

Gary Beecham, PhDJohn P. Hussman Institute for Human

GenomicsUniversity of Miami, Miller School of Medicine

Introduction

Primary Purpose of Genomics:• Discover mechanisms underlying disease, to

predict, to prevent, and to treat human disease

Introduction: Central Dogma• DNA

– Transcription

• Messenger RNA (mRNA)– Translation

• Proteins

Introduction: Central DogmaPre-Transcription• Structural

(chromatin)• Methylation• Small regulatory

RNAsPost-Transcription• Splicing• polyA, capping• RNA

degredation

Introduction

Purpose:• Discover mechanisms underlying disease, to

predict, to prevent, and to treat human diseaseHypothesis• DNA, RNA, or other regulatory changes (e.g.,

miRNA, epigenetic factors) lead to altered proteins, altered abundance of proteins, or altered regulation of proteins, thereby influencing disease

Introduction

We are focusing on DNA:• DNA is the

building blocks• Inherited• Cheaper & easier

to assess

• DNA is the primary focus for much research

Introduction: DNA variationACCCTTGAAAAGCTGATGAAGGCATTCGAG

ACCCTTGAAAAGCGGATGAAGGCATTCGAG

ACCCTTGAAAAGC-GATGAAGGCATTCGAG

ACCCTTGAAAAGCTAGATGAAGGCATTCGAG

ACCCTTGAAAAGCTGATGATGAAGGCATTCG

SNP/SNV: single nucleotide polymorphism/variant

Deletion

Insertion -- “indel”

CNV/SV: copy number variant, structural variant

Specific “types” at specific loci are known as ALLELES; invariant loci are said to be monomorphic

Chromosome

Introduction: DNA variation

haplotypes

genotype 1

Disease variant/QTL(quantitative trait locus)

SNPs/markers

genotype 2

allele 1

allele 2

Paternal

Maternal

Paternal

Maternal

Paternal

Maternal

Introduction: Refining the Hypothesis

Hypotheses• DNA, RNA, or other regulatory changes (e.g., miRNA,

epigenetic factors) lead to altered proteins, altered abundance of proteins, or altered regulation of proteins, thereby influencing disease

• Certain ALLELES on particular HAPLOTYPES/CHROMOSOMES lead to altered proteins, altered abundance of proteins, or altered regulation of proteins, thereby influencing disease

• Certain ALLELES/GENOTYPES are more common among those with disease than those without

Introduction

Primary Research Questions• Are some genomic regions linked with disease

phenotypes in families?• Are some alleles associated with disease

phenotypes?

Linkage and Association

Linkage vs Association

• Linkage analysis: co-segregation of a region/locus with disease through families– qualitative and quantitative traits– big or small families

• Association analysis: correlation of alleles with disease across populations/families– qualitative and quantitative traits– populations or small families

Linkage vs Association

12 34

14 13 14

12 34

24 23 24

12 34

13 14 13

12 34

14 13 14

LINKAGE: Same marker, different allele (region)

LINKAGE + ASSOCIATION: Same marker, same allele (haplotype)

14 11 1214

24 22 34 34

Design Considerations

• Disease Type:• Mechanism Hypothesis:• Sample Collection:• Scope:• Specificity:

Simple vs ComplexCommon vs Rare VariationFamilies vs PopulationGenome vs CandidateHigh vs Low

Design Considerations: Disease Type

SIMPLE DISEASE: • Typically very rare, very severe, with “Mendelian” inheritance

patterns; often “on” vs “off” diseases, syndromes, often earlier onset

• Typically ONE or very few genetic causes of disease (simple etiology)

• Examples– Autosomal Recessive: cystic fibrosis, sickle cell disease Tay-Sachs,

phenylketonuria– Autosomal Dominant: Huntington disease, achondroplasia, familial

hypercholesterolaemia– Sex-linked: Fragile X, haemophilia A, Duchenne muscular dystrophy

Design Considerations: Disease Type

COMPLEX DISEASE: • Typically more common, less “severe”, with complex (or no

clear) inheritance pattern. Less “on” vs “off”; complex and/or progressive disease course; often later or varied onset

• Generally POLYGENIC with important environmental and interaction effects (e.g., complex etiology)

• Examples– Late-onset Alzheimer disease– Parkinson disease– Cardiovascular disease– Dislipidemia– Multiple sclerosis

Design Considerations: Mechanism

Common Disease – Common Variant (CDCV)

• Most influences on common complex diseases are due to common polymorphisms (> 1% allele frequency)

• Basis for use of association methods to find complex disease trait loci; linkage will not work

Common Disease-Rare Variant (CDRV)

• Common diseases are caused by mixture of common and rare alleles

• Some gene associations might reflect aggregates of rare alleles

• Linkage and association both work and don’t work, in different ways

Bodmer & Bonilla, Nature Genetics 2008;40:695-701

Rare Disease – Rare Variant (RDRV)

• Most influences on rare, simple diseases are due to rare variants of strong effect

• Linkage will work; association may or may not depending on disease and allele frequencies and sample size

Design Considerations: Sample CollectionsSINGLE AFFECTED

RELATIVE PAIRS

EXTENDED FAMILIES

Case-ControlTrios Case Only

Sibpair Twins Avuncular

asso

ciatio

nlin

kage a

nd/o

r asso

ciatio

n

Design Considerations: Scope

Genome-Wide• Look everywhere

– “unbiased”

• Millions of data points per person

• Millions of tests– Need larger samples or

stronger effect sizes to detect

• Generally more expensive (but more data per $)

Candidate Region• Focus on one place

– “targeted”– Biological or locational

candidate

• Fewer tests– Smaller samples,

smaller effect sizes– Larger chance of

missing an effect

• Generally less expensive (less data per $)

Design Considerations: Specificity

• Linkage methods: low specificity; that is, they typically identify very broad regions of the genome

• Association methods: high specificity; they typically identify very narrow regions of the genome, relative to linkage.


Beecham et al., 2015, Neurology

Linkage Disequilibrium

• Linkage Equilibrium: alleles at different loci are inherited independently

• Linkage Disequilbrium: alleles at different loci are not independent and inheritance at one locus is correlated with the other

Linkage Disequilibrium• The “A” allele (locus 1) tends to be

inherited with the “B” allele (locus 2)

• The event “gamete carries A” is not independent of the event “gamete carries B”

• Locus 1 and 2 are not independent; they are in linkage disequilibrium

• The “A” allele is not preferentially inherited with “Z” or “z”

• The event “gamete carries “A” is independent of the event “gamete carries Z”

• Locus 1 and 3 are independent; they are in linkage equilibrium

AB

ABA

B

ab

ab

ab

AB

AZA

z

Az

aza

Z

az

AZ

1

2

3

Linkage Disequilibrium: New Mutation

wB

wBw

B

wb

wb

wb

DB

wZw

z

wz

wzw

Z

wz

DZ

1

2

3

DB

wBD

B

wb

wb

wb

DB

wZD

z

wz

Dzw

Z

wz

DZ

Linkage Disequilibrium

• All new mutations are in complete LD with everything on the initial chromosome

• So, why aren’t entire chromosomes linked to themselves?

• How does linkage disequilibrium decay over time?

• RECOMBINATION

Decay of LDAncestral

haplotypes

A

C

G

G

MUTATIONA

C

G

G

C T

After mutation

event

Locus 1 Locus 2

RECOMBINATION

A G

C G

C T

A T

Recombinant haplotype; incomplete LD

A G

C G

C T

Complete LD between C and T

Decay of LD• In a large, random-mating population, in the absence of mutation,

migration and selection:

• Dt = Do (1 - )t

• Dt= disequilibrium coefficient after t generations

• Do= disequilibrium coefficient in initial generation

• = recombination fraction

0 1 2 3 4 5 6 7 8 9 10

=0q=0.01q=0.1q=0.5q

t

θ

1.0

0.8

0.6

0.4

0.2

0.0

What does this have to do with association and linkage?

• Linkage and Association BOTH rely on this idea of LD decay

• Linkage relies on the decay of chromosomal haplotypes within families due to meioses between relatives

• Association relies on the decay of haplotypes across individuals due to the cumulative meioses throughout the entire population

• MARKER loci are tested; positive tests indicate a disease locus is “near” the marker locus.


Beecham et al., 2015, Neurology

L I N K A G E ASSOCIATION

SINGLE AFFECTED

RELATIVE PAIRS

EXTENDED FAMILIES



asso

ciatio

nlin

kage a

nd/o

r asso

ciatio

n

SINGLE AFFECTED

RELATIVE PAIRS

EXTENDED FAMILIES



asso

ciatio

nlin

kage a

nd/o

r asso

ciatio

n

Linkage Analyses

Types of Linkage Analysis

• Parametric – LOD score – genetic model is specified– Inheritance pattern (e.g., dominant, recessive, additive, etc)– Penetrance (e.g., strength of genetic effect)– Disease allele frequency

• Non-parametric – affecteds only; e.g., affected sibling pair, affected relative pair – genetic model is not specified– Analyzes allele sharing in affected relatives– Unaffected relatives usually only used to establish phase of

alleles

Parametric Linkage: LOD scores

Lod Score:

Z > 3 ~ significant evidence FOR linkageZ< -2 ~ significant evidence FOR non-linkage-2 < Z < 3 ~insufficient evidence for either linkage or non-

Parametric Linkage

• That’s the easy part!– The Hard Part: determining recombinants, non-

recombinants, determining phase, dealing with unknown phase, multipoint vs two-point, determining models, etc

• For more detail: http://hihg.med.miami.edu/educational-programs/online-genetics-courses/

• Software: MERLIN (Abecasis)

http://hihg.med.miami.edu/educational-programs/online-genetics-courses/




Nonparametric Linkage Analysis

• If the same disease gene causes disease in both members of a relative pair, then the relatives should have inherited the same alleles of genetic markers near that gene more often than would be expected by chance alone (Penrose 1935, Suarez et al. 1978).

• This approach to linkage analysis makes no assumption about the inheritance pattern.

• Ignores unaffected family members

• ASP (Affected Sib-Pair)• ARP (Affected Relative-Pair)

IBD versus IBS

• Identical by Descent (IBD) sharing– relative pairs have inherited the same allele from a common

ancestor– we can trace that allele from a common ancestor down the

family tree to the descendants

• Identical by State (IBS) sharing– pairs share the same allele TYPE regardless of ancestral origin– unrelated people can share alleles IBS

IBD and IBS

We infer IBD status using IBS and relationship information.

– The parents share no alleles IBD, one allele IBS.– The daughters share two alleles IBD (and IBS).– If the parents were not genotyped, the daughters IBD state could be

0, 1, or 2.

Inferring IBD: Parental Genotypes Known

Inferring IBD: Parental Genotypes Unknown

• Calculate the probability of all possible parental genotype mating types.– 11 x 23 → Pr(IBD =0) = ½, Pr(IBD =1) = ½– 12 x 13 → Pr(IBD =0) = 1– 1_ x 23 → Pr(IBD =1) = 1, where _ denotes any

alleles other than allele 1

• Add up all probabilities for IBD = 0 and IBD = 1

• Different allele frequencies will result in different probabilities

• Adding additional siblings can reduce uncertainty

Affected Sibling Pair (ASP) linkage tests

• Determine IBD sharing for each sibling pair.• Tests:

– Chi-squared goodness of fit test: examine deviations of observed from expected IBD distribution

– Means test: compare the mean observed number of alleles shared IBD to the expected number (i.e. 1)

– Two allele test: compare the observed number of pairs sharing 2 alleles IBD with that expected

• Generally the means test is the most powerful

ASP tests• ASPs may be easier to collect

than large extended pedigrees, especially for late onset disorders

• Has reasonable power in the presence of genetic heterogeneity, provided that at least one gene has a detectable effect

• Uses only affected individuals, thus non-penetrant gene carriers do not reduce power

• Most tests require that IBD status be known. Pairs in which IBD status is unknown cannot be used.– Requires parents or enough

siblings so that parental genotypes can be inferred unambiguously

• May require large number of affected sibpairs to achieve reasonable power.

ADVANTAGES DISADVANTAGES

Affected Relative Pair Analysis • Like affected sibpair analysis, it does not require assumptions about:

– Mode of inheritance– Disease allele frequency– Penetrance

• Unlike affected sibpair analysis, it uses all affected relatives regardless of relationship– Not restricted to affected sibpairs– Extracts more information from extended pedigrees

• Prefer to use all data possible• Common approach: NPL statistic (Kruglyak et al, 1996) and

extensions• MERLIN (Abecasis et al.) a frequently used implementation

JAMA, 1997

http://www.sciencedirect.com/science?_ob=MiamiCaptionURL&_method=retrieve&_udi=B8JDD-4RDPX2J-J&_image=B8JDD-4RDPX2J-J-3&_ba=&_user=687830&_rdoc=1&_fmt=full&_orig=search&_cdi=43612&view=c&_isHiQual=Y&_acct=C000038359&_version=1&_urlVersion=0&_userid=687830&md5=85da6e68e4a46fa745b057d157b008c8

SINGLE AFFECTED

RELATIVE PAIRS

EXTENDED FAMILIES



asso

ciatio

nlin

kage a

nd/o

r asso

ciatio

n

SINGLE AFFECTED

RELATIVE PAIRS

EXTENDED FAMILIES



asso

ciatio

nlin

kage a

nd/o

r asso

ciatio

nAssociation Analyses

Direct and Indirect allelic association

Genotyped MARKER allele

Causal allele

Correlation (LD)

• Direct approach requires some prior knowledge of the variant

• Indirect approach is agnostic with relation to functional relevance

PhenotypeIndirect association

Haplotype

Ref: Balding. Nature Reviews. Vol 7. Oct 2006

Direct association

Why test for association and not linkage?

• Resolution of mapping– Meioses within families are limited. Linkage analysis generally

identifies large regions – Association analysis takes advantage of historical meioses. Better

suited for fine-mapping• Cost

– High-throughput genotyping makes genome-wide association studies (GWAS) competitive to linkage screens

• Association analysis is a well suited approach for investigating the common disease common variant (CDCV) hypothesis, through indirect association

• Direct association is becoming increasingly more feasible with cheaper next-generation sequencing technologies (CDRV), but sample size is still an issue

Linkage DisequilibriumAncestral

haplotypes

A

C

G

G

MUTATIONA

C

G

G

C T

After mutation

event

Locus 1 Locus 2

RECOMBINATION

A G

C G

C T

A T

Recombinant haplotype; incomplete LD

A G

C G

C T

Complete LD between C and T

Linkage disequilibrium in populationsAncestral chromosome

Sampling of different chromosomes at the present day

Tests at marker (tag) SNPS on the ancestral chromosome become proxy tests for disease loci on the ancestral chromosome

(indirect association)

Molecular Psychiatry (2005) 10, 328

Case-control tests : single SNP• Samples of unrelated individuals

– Cases=Affected– Controls=Unaffected

• Observe genotypes at a locus (loci)– M1M1 (11: reference homozygote)

– M1M2 (12: heterozygote)

– M2M2 (22: non-reference homozygote)

• Two general categories of test statistics– Chi-squared Tests – Logistic Regression

Case-control tests: Chi-squared tests

• Allele-based• Ho:p1|case=p1|control

• Tends to have good power under a variety of genetic models

• Requires Hardy-Weinberg Equilibrium

n

1A n 1U

n

M n

n

M 2

n

n

1

2 2A 2U n

1

Cases Controls

A U N

Tcc = i = 1

2 (niA – niU)2

niA + niU

Tcc ~ Chi-squared with 1 df

• When the numbers of cases and controls are not equal (i.e., nA≠ nU)



n

1A n 1U

n

M n

n

M 2

n

n

1

2 2A 2U n

1

Cases Controls

A U N

M

M

1

2

Cases Controls

n

n1

n

n2

A U N

nAn1

NnU

n1

N

nAn2

NnU

n2

N

cellsExp

(Obs – Exp)2

Tcc =

Observed Counts Expected Counts

Tcc ~ Chi-squared with 1 df

• Example: Testing for association of APOE-4 and late-onset Alzheimer’s disease

360

60

Cases Controls

e4

Not e4

240

340

600 400

300

700 420

120

Cases Controls

e4

Not e4

180

280

600 400

300

700

Observed Counts Expected Counts

(240 - 180)2 (60 - 120)2 (360 - 420)2 (340 - 280)2

180 120 420 280

p-value = 10-16

+ + +Tcc =

= 71.5

Case-Control Test Cont.Case-Control Test Cont.


• Genotype-basedHo:p11|case=p11|control and p12|case=p12|control and p22|case=p22|control

– 2 df test– Can also test a specific genotype– Does not require Hardy-Weinberg Equilibrium

• Armitage’s Trend Test (Cochran-Armitage)– Restricted alternative hypothesis assumes additive

effect on 12 genotype (between 11 and 22).

Case-control tests: Logistic Regression

• Models log odds of being affected given risk factors (e.g., genotypes)– Full genotype model

– Ho:b1=b2=0

iii

i XXp

p221101

ln

Term for 11 vs 22

Term for 12 vs 22

pi=probability that ith individual is affectedX1i=1 if genotype 11 and 0 otherwise; X2i=1 if genotype 12 and 0 otherwise

Case-control tests: Logistic Regression

• Linear model

– Ho:b1=0

– HA:b1≠0

• Can incorporate covariates, environmental factors and test interactions

ii

i Xp

p1101

ln

Term for linear genotype effect

X1i= count of allele 1

Estimation of Odds Ratios

• Can estimate effects, not just test them!• In general genotype model

: estimates the ratio of the odds of being affected given genotype 11 relative to the odds of being affected given genotype 22

: estimates the ratio of the odds of being affected given genotype 12 relative to the odds of being affected given genotype 22

• Confidence intervals are computed in standard statistical packages

e 1̂

e 2̂

Quantitative Trait Tests

• Linear model

– Ho:b1=0

– HA:b1≠0

ii Xy 110

Term for linear genotype effect

X1i= count of allele 1

Family Designs

• Regression-based Tests: covariates in the model account for correlations within families (e.g., between family members)– Generalized Estimating Equations (GEE)– Linear/Logistic Mixed Models

• Transmission-based tests (specific to genomics)

Within-Family Tests of Association• Parental Controls

– “transmitted”– “nontransmitted”

• Cases and “controls” are well matched (e.g., population substructure is not an issue)

M1M2

M1M2

M1M1Trans M1M2

Nontrans M1M1

Transmission Disequilibrium Test (TDT) • As a test for linkage, can use singleton families

affected sib families and extended families

• H0: No difference in frequency between trans/nontrans

• Comparable to 2 with 1 df (for large samples)

tran

snontrans

n21

n12

M1 M2

M1

M2

M1M2

M1M2 M1M1

M1M1

TDT =(n12-n21)2

n12+n21

Association in the Presence of Linkage

• TDT is a test of both linkage AND association; has problems with missing parental data

• Test for Association in the Presence of Linkage (APL) (Martin et al 2003; Chung et al 2006)– Based on difference between counts of transmitted and

nontransmitted alleles– With missing parents APL estimates expected count in

nontransmitted alleles using EM algorithm

Inferring Missing Parental Genotypes

• Test for Association in the Presence of Linkage (APL) (Martin et al 2003; Chung et al 2006)– Based on difference between counts of transmitted and

nontransmitted alleles– With missing parents APL estimates expected count in

nontransmitted alleles using EM algorithm

Results

Interesting

HLA

IMSGC, N Engl J Med. 2007 Aug 30;357(9):851-62

Association Software

• Genetics Specific: PLINK (Purcell et al)• Standard Statistical packages

– R statistical programming language– SAS

• The programming challenge is handling/looping through thousands of markers

Other Designs

• Longitudinal Studies (prospective cohorts)– Very very expensive (-)– Fewer biases in estimates (+)

• Founder or Isolated Populations– Often less genetic heterogeneity (+)– Less likely to be able to apply results to other populations (-)

• Imputation Analyses– Infer genotypes at unobserved loci (+)– Infer genotypes at unobserved loci (-)

• Meta-analyses– Combine results across datasets (++++)– Virtually all major genetics consortia use these methods

Other Analyses

Secondary Research Questions• Do alleles interact with other alleles to influence

phenotypes? (gene x gene interaction)• Do alleles interact with environmental factors? (gene x

environment interaction)• Are sets of alleles associated with disease? (pathway

analyses, gene-based tests)• Do certain alleles predict disease? (risk prediction, risk

score analyses)• Do certain alleles influence mRNA abundance (expression-

QTL analyses)

Thanks!Any questions?

Study Design in Human Genetics and Genomics Gary Beecham, PhD John P. Hussman Institute for Human...

Documents

Transcript of Study Design in Human Genetics and Genomics Gary Beecham, PhD John P. Hussman Institute for Human...