Study Design in Human Genetics and Genomics Gary Beecham, PhD John P. Hussman Institute for Human...
-
Upload
marshall-harmon -
Category
Documents
-
view
225 -
download
0
Transcript of Study Design in Human Genetics and Genomics Gary Beecham, PhD John P. Hussman Institute for Human...
Study Design in Human Genetics and Genomics
Gary Beecham, PhDJohn P. Hussman Institute for Human
GenomicsUniversity of Miami, Miller School of Medicine
Introduction
Primary Purpose of Genomics:• Discover mechanisms underlying disease, to
predict, to prevent, and to treat human disease
Introduction: Central Dogma• DNA
– Transcription
• Messenger RNA (mRNA)– Translation
• Proteins
Introduction: Central DogmaPre-Transcription• Structural
(chromatin)• Methylation• Small regulatory
RNAsPost-Transcription• Splicing• polyA, capping• RNA
degredation
Introduction
Purpose:• Discover mechanisms underlying disease, to
predict, to prevent, and to treat human diseaseHypothesis• DNA, RNA, or other regulatory changes (e.g.,
miRNA, epigenetic factors) lead to altered proteins, altered abundance of proteins, or altered regulation of proteins, thereby influencing disease
Introduction
We are focusing on DNA:• DNA is the
building blocks• Inherited• Cheaper & easier
to assess
• DNA is the primary focus for much research
Introduction: DNA variationACCCTTGAAAAGCTGATGAAGGCATTCGAG
ACCCTTGAAAAGCGGATGAAGGCATTCGAG
ACCCTTGAAAAGC-GATGAAGGCATTCGAG
ACCCTTGAAAAGCTAGATGAAGGCATTCGAG
ACCCTTGAAAAGCTGATGATGAAGGCATTCG
SNP/SNV: single nucleotide polymorphism/variant
Deletion
Insertion -- “indel”
CNV/SV: copy number variant, structural variant
Specific “types” at specific loci are known as ALLELES; invariant loci are said to be monomorphic
Chromosome
Introduction: DNA variation
haplotypes
genotype 1
Disease variant/QTL(quantitative trait locus)
SNPs/markers
genotype 2
allele 1
allele 2
Paternal
Maternal
Paternal
Maternal
Paternal
Maternal
Introduction: Refining the Hypothesis
Hypotheses• DNA, RNA, or other regulatory changes (e.g., miRNA,
epigenetic factors) lead to altered proteins, altered abundance of proteins, or altered regulation of proteins, thereby influencing disease
• Certain ALLELES on particular HAPLOTYPES/CHROMOSOMES lead to altered proteins, altered abundance of proteins, or altered regulation of proteins, thereby influencing disease
• Certain ALLELES/GENOTYPES are more common among those with disease than those without
Introduction
Primary Research Questions• Are some genomic regions linked with disease
phenotypes in families?• Are some alleles associated with disease
phenotypes?
Linkage and Association
Linkage vs Association
• Linkage analysis: co-segregation of a region/locus with disease through families– qualitative and quantitative traits– big or small families
• Association analysis: correlation of alleles with disease across populations/families– qualitative and quantitative traits– populations or small families
Linkage vs Association
12 34
14 13 14
12 34
24 23 24
12 34
13 14 13
12 34
14 13 14
LINKAGE: Same marker, different allele (region)
LINKAGE + ASSOCIATION: Same marker, same allele (haplotype)
14 11 1214
24 22 34 34
Design Considerations
• Disease Type:• Mechanism Hypothesis:• Sample Collection:• Scope:• Specificity:
Simple vs ComplexCommon vs Rare VariationFamilies vs PopulationGenome vs CandidateHigh vs Low
Design Considerations: Disease Type
SIMPLE DISEASE: • Typically very rare, very severe, with “Mendelian” inheritance
patterns; often “on” vs “off” diseases, syndromes, often earlier onset
• Typically ONE or very few genetic causes of disease (simple etiology)
• Examples– Autosomal Recessive: cystic fibrosis, sickle cell disease Tay-Sachs,
phenylketonuria– Autosomal Dominant: Huntington disease, achondroplasia, familial
hypercholesterolaemia– Sex-linked: Fragile X, haemophilia A, Duchenne muscular dystrophy
Design Considerations: Disease Type
COMPLEX DISEASE: • Typically more common, less “severe”, with complex (or no
clear) inheritance pattern. Less “on” vs “off”; complex and/or progressive disease course; often later or varied onset
• Generally POLYGENIC with important environmental and interaction effects (e.g., complex etiology)
• Examples– Late-onset Alzheimer disease– Parkinson disease– Cardiovascular disease– Dislipidemia– Multiple sclerosis
Design Considerations: Mechanism
Common Disease – Common Variant (CDCV)
• Most influences on common complex diseases are due to common polymorphisms (> 1% allele frequency)
• Basis for use of association methods to find complex disease trait loci; linkage will not work
Common Disease-Rare Variant (CDRV)
• Common diseases are caused by mixture of common and rare alleles
• Some gene associations might reflect aggregates of rare alleles
• Linkage and association both work and don’t work, in different ways
Bodmer & Bonilla, Nature Genetics 2008;40:695-701
Rare Disease – Rare Variant (RDRV)
• Most influences on rare, simple diseases are due to rare variants of strong effect
• Linkage will work; association may or may not depending on disease and allele frequencies and sample size
Design Considerations: Sample CollectionsSINGLE AFFECTED
RELATIVE PAIRS
EXTENDED FAMILIES
Case-ControlTrios Case Only
Sibpair Twins Avuncular
asso
ciatio
nlin
kage a
nd/o
r asso
ciatio
n
Design Considerations: Scope
Genome-Wide• Look everywhere
– “unbiased”
• Millions of data points per person
• Millions of tests– Need larger samples or
stronger effect sizes to detect
• Generally more expensive (but more data per $)
Candidate Region• Focus on one place
– “targeted”– Biological or locational
candidate
• Fewer tests– Smaller samples,
smaller effect sizes– Larger chance of
missing an effect
• Generally less expensive (less data per $)
Design Considerations: Specificity
• Linkage methods: low specificity; that is, they typically identify very broad regions of the genome
• Association methods: high specificity; they typically identify very narrow regions of the genome, relative to linkage.
Design Considerations: Specificity
Beecham et al., 2015, Neurology
Linkage Disequilibrium
• Linkage Equilibrium: alleles at different loci are inherited independently
• Linkage Disequilbrium: alleles at different loci are not independent and inheritance at one locus is correlated with the other
Linkage Disequilibrium• The “A” allele (locus 1) tends to be
inherited with the “B” allele (locus 2)
• The event “gamete carries A” is not independent of the event “gamete carries B”
• Locus 1 and 2 are not independent; they are in linkage disequilibrium
• The “A” allele is not preferentially inherited with “Z” or “z”
• The event “gamete carries “A” is independent of the event “gamete carries Z”
• Locus 1 and 3 are independent; they are in linkage equilibrium
AB
ABA
B
ab
ab
ab
AB
AZA
z
Az
aza
Z
az
AZ
1
2
3
Linkage Disequilibrium: New Mutation
wB
wBw
B
wb
wb
wb
DB
wZw
z
wz
wzw
Z
wz
DZ
1
2
3
DB
wBD
B
wb
wb
wb
DB
wZD
z
wz
Dzw
Z
wz
DZ
Linkage Disequilibrium
• All new mutations are in complete LD with everything on the initial chromosome
• So, why aren’t entire chromosomes linked to themselves?
• How does linkage disequilibrium decay over time?
• RECOMBINATION
Decay of LDAncestral
haplotypes
A
C
G
G
MUTATIONA
C
G
G
C T
After mutation
event
Locus 1 Locus 2
RECOMBINATION
A G
C G
C T
A T
Recombinant haplotype; incomplete LD
A G
C G
C T
Complete LD between C and T
Decay of LD• In a large, random-mating population, in the absence of mutation,
migration and selection:
• Dt = Do (1 - )t
• Dt= disequilibrium coefficient after t generations
• Do= disequilibrium coefficient in initial generation
• = recombination fraction
0 1 2 3 4 5 6 7 8 9 10
=0q=0.01q=0.1q=0.5q
t
θ
1.0
0.8
0.6
0.4
0.2
0.0
What does this have to do with association and linkage?
• Linkage and Association BOTH rely on this idea of LD decay
• Linkage relies on the decay of chromosomal haplotypes within families due to meioses between relatives
• Association relies on the decay of haplotypes across individuals due to the cumulative meioses throughout the entire population
• MARKER loci are tested; positive tests indicate a disease locus is “near” the marker locus.
Design Considerations: Specificity
Beecham et al., 2015, Neurology
L I N K A G E ASSOCIATION
SINGLE AFFECTED
RELATIVE PAIRS
EXTENDED FAMILIES
Case-ControlTrios Case Only
Sibpair Twins Avuncular
asso
ciatio
nlin
kage a
nd/o
r asso
ciatio
n
SINGLE AFFECTED
RELATIVE PAIRS
EXTENDED FAMILIES
Case-ControlTrios Case Only
Sibpair Twins Avuncular
asso
ciatio
nlin
kage a
nd/o
r asso
ciatio
n
Linkage Analyses
Types of Linkage Analysis
• Parametric – LOD score – genetic model is specified– Inheritance pattern (e.g., dominant, recessive, additive, etc)– Penetrance (e.g., strength of genetic effect)– Disease allele frequency
• Non-parametric – affecteds only; e.g., affected sibling pair, affected relative pair – genetic model is not specified– Analyzes allele sharing in affected relatives– Unaffected relatives usually only used to establish phase of
alleles
Parametric Linkage: LOD scores
Lod Score:
Z > 3 ~ significant evidence FOR linkageZ< -2 ~ significant evidence FOR non-linkage-2 < Z < 3 ~insufficient evidence for either linkage or non-
Parametric Linkage
• That’s the easy part!– The Hard Part: determining recombinants, non-
recombinants, determining phase, dealing with unknown phase, multipoint vs two-point, determining models, etc
• For more detail: http://hihg.med.miami.edu/educational-programs/online-genetics-courses/
• Software: MERLIN (Abecasis)
Nonparametric Linkage Analysis
• If the same disease gene causes disease in both members of a relative pair, then the relatives should have inherited the same alleles of genetic markers near that gene more often than would be expected by chance alone (Penrose 1935, Suarez et al. 1978).
• This approach to linkage analysis makes no assumption about the inheritance pattern.
• Ignores unaffected family members
• ASP (Affected Sib-Pair)• ARP (Affected Relative-Pair)
IBD versus IBS
• Identical by Descent (IBD) sharing– relative pairs have inherited the same allele from a common
ancestor– we can trace that allele from a common ancestor down the
family tree to the descendants
• Identical by State (IBS) sharing– pairs share the same allele TYPE regardless of ancestral origin– unrelated people can share alleles IBS
IBD and IBS
We infer IBD status using IBS and relationship information.
– The parents share no alleles IBD, one allele IBS.– The daughters share two alleles IBD (and IBS).– If the parents were not genotyped, the daughters IBD state could be
0, 1, or 2.
Inferring IBD: Parental Genotypes Known
Inferring IBD: Parental Genotypes Unknown
• Calculate the probability of all possible parental genotype mating types.– 11 x 23 → Pr(IBD =0) = ½, Pr(IBD =1) = ½– 12 x 13 → Pr(IBD =0) = 1– 1_ x 23 → Pr(IBD =1) = 1, where _ denotes any
alleles other than allele 1
• Add up all probabilities for IBD = 0 and IBD = 1
• Different allele frequencies will result in different probabilities
• Adding additional siblings can reduce uncertainty
Affected Sibling Pair (ASP) linkage tests
• Determine IBD sharing for each sibling pair.• Tests:
– Chi-squared goodness of fit test: examine deviations of observed from expected IBD distribution
– Means test: compare the mean observed number of alleles shared IBD to the expected number (i.e. 1)
– Two allele test: compare the observed number of pairs sharing 2 alleles IBD with that expected
• Generally the means test is the most powerful
ASP tests• ASPs may be easier to collect
than large extended pedigrees, especially for late onset disorders
• Has reasonable power in the presence of genetic heterogeneity, provided that at least one gene has a detectable effect
• Uses only affected individuals, thus non-penetrant gene carriers do not reduce power
• Most tests require that IBD status be known. Pairs in which IBD status is unknown cannot be used.– Requires parents or enough
siblings so that parental genotypes can be inferred unambiguously
• May require large number of affected sibpairs to achieve reasonable power.
ADVANTAGES DISADVANTAGES
Affected Relative Pair Analysis • Like affected sibpair analysis, it does not require assumptions about:
– Mode of inheritance– Disease allele frequency– Penetrance
• Unlike affected sibpair analysis, it uses all affected relatives regardless of relationship– Not restricted to affected sibpairs– Extracts more information from extended pedigrees
• Prefer to use all data possible• Common approach: NPL statistic (Kruglyak et al, 1996) and
extensions• MERLIN (Abecasis et al.) a frequently used implementation
SINGLE AFFECTED
RELATIVE PAIRS
EXTENDED FAMILIES
Case-ControlTrios Case Only
Sibpair Twins Avuncular
asso
ciatio
nlin
kage a
nd/o
r asso
ciatio
n
SINGLE AFFECTED
RELATIVE PAIRS
EXTENDED FAMILIES
Case-ControlTrios Case Only
Sibpair Twins Avuncular
asso
ciatio
nlin
kage a
nd/o
r asso
ciatio
nAssociation Analyses
Direct and Indirect allelic association
Genotyped MARKER allele
Causal allele
Correlation (LD)
• Direct approach requires some prior knowledge of the variant
• Indirect approach is agnostic with relation to functional relevance
PhenotypeIndirect association
Haplotype
Ref: Balding. Nature Reviews. Vol 7. Oct 2006
Direct association
Why test for association and not linkage?
• Resolution of mapping– Meioses within families are limited. Linkage analysis generally
identifies large regions – Association analysis takes advantage of historical meioses. Better
suited for fine-mapping• Cost
– High-throughput genotyping makes genome-wide association studies (GWAS) competitive to linkage screens
• Association analysis is a well suited approach for investigating the common disease common variant (CDCV) hypothesis, through indirect association
• Direct association is becoming increasingly more feasible with cheaper next-generation sequencing technologies (CDRV), but sample size is still an issue
Linkage DisequilibriumAncestral
haplotypes
A
C
G
G
MUTATIONA
C
G
G
C T
After mutation
event
Locus 1 Locus 2
RECOMBINATION
A G
C G
C T
A T
Recombinant haplotype; incomplete LD
A G
C G
C T
Complete LD between C and T
Linkage disequilibrium in populationsAncestral chromosome
Sampling of different chromosomes at the present day
Tests at marker (tag) SNPS on the ancestral chromosome become proxy tests for disease loci on the ancestral chromosome
(indirect association)
Molecular Psychiatry (2005) 10, 328
Case-control tests : single SNP• Samples of unrelated individuals
– Cases=Affected– Controls=Unaffected
• Observe genotypes at a locus (loci)– M1M1 (11: reference homozygote)
– M1M2 (12: heterozygote)
– M2M2 (22: non-reference homozygote)
• Two general categories of test statistics– Chi-squared Tests – Logistic Regression
Case-control tests: Chi-squared tests
• Allele-based• Ho:p1|case=p1|control
• Tends to have good power under a variety of genetic models
• Requires Hardy-Weinberg Equilibrium
n
1A n 1U
n
M n
n
M 2
n
n
1
2 2A 2U n
1
Cases Controls
A U N
Tcc = i = 1
2 (niA – niU)2
niA + niU
Tcc ~ Chi-squared with 1 df
• When the numbers of cases and controls are not equal (i.e., nA≠ nU)
Case-control tests: Chi-squared tests
Case-control tests: Chi-squared tests
n
1A n 1U
n
M n
n
M 2
n
n
1
2 2A 2U n
1
Cases Controls
A U N
M
M
1
2
Cases Controls
n
n1
n
n2
A U N
nAn1
NnU
n1
N
nAn2
NnU
n2
N
cellsExp
(Obs – Exp)2
Tcc =
Observed Counts Expected Counts
Tcc ~ Chi-squared with 1 df
• Example: Testing for association of APOE-4 and late-onset Alzheimer’s disease
360
60
Cases Controls
e4
Not e4
240
340
600 400
300
700 420
120
Cases Controls
e4
Not e4
180
280
600 400
300
700
Observed Counts Expected Counts
(240 - 180)2 (60 - 120)2 (360 - 420)2 (340 - 280)2
180 120 420 280
p-value = 10-16
+ + +Tcc =
= 71.5
Case-Control Test Cont.Case-Control Test Cont.
Case-control tests: Chi-squared tests
• Genotype-basedHo:p11|case=p11|control and p12|case=p12|control and p22|case=p22|control
– 2 df test– Can also test a specific genotype– Does not require Hardy-Weinberg Equilibrium
• Armitage’s Trend Test (Cochran-Armitage)– Restricted alternative hypothesis assumes additive
effect on 12 genotype (between 11 and 22).
Case-control tests: Logistic Regression
• Models log odds of being affected given risk factors (e.g., genotypes)– Full genotype model
– Ho:b1=b2=0
iii
i XXp
p221101
ln
Term for 11 vs 22
Term for 12 vs 22
pi=probability that ith individual is affectedX1i=1 if genotype 11 and 0 otherwise; X2i=1 if genotype 12 and 0 otherwise
Case-control tests: Logistic Regression
• Linear model
– Ho:b1=0
– HA:b1≠0
• Can incorporate covariates, environmental factors and test interactions
ii
i Xp
p1101
ln
Term for linear genotype effect
X1i= count of allele 1
Estimation of Odds Ratios
• Can estimate effects, not just test them!• In general genotype model
: estimates the ratio of the odds of being affected given genotype 11 relative to the odds of being affected given genotype 22
: estimates the ratio of the odds of being affected given genotype 12 relative to the odds of being affected given genotype 22
• Confidence intervals are computed in standard statistical packages
e 1̂
e 2̂
Quantitative Trait Tests
• Linear model
– Ho:b1=0
– HA:b1≠0
ii Xy 110
Term for linear genotype effect
X1i= count of allele 1
Family Designs
• Regression-based Tests: covariates in the model account for correlations within families (e.g., between family members)– Generalized Estimating Equations (GEE)– Linear/Logistic Mixed Models
• Transmission-based tests (specific to genomics)
Within-Family Tests of Association• Parental Controls
– “transmitted”– “nontransmitted”
• Cases and “controls” are well matched (e.g., population substructure is not an issue)
M1M2
M1M2
M1M1Trans M1M2
Nontrans M1M1
Transmission Disequilibrium Test (TDT) • As a test for linkage, can use singleton families
affected sib families and extended families
• H0: No difference in frequency between trans/nontrans
• Comparable to 2 with 1 df (for large samples)
tran
snontrans
n21
n12
M1 M2
M1
M2
M1M2
M1M2 M1M1
M1M1
TDT =(n12-n21)2
n12+n21
Association in the Presence of Linkage
• TDT is a test of both linkage AND association; has problems with missing parental data
• Test for Association in the Presence of Linkage (APL) (Martin et al 2003; Chung et al 2006)– Based on difference between counts of transmitted and
nontransmitted alleles– With missing parents APL estimates expected count in
nontransmitted alleles using EM algorithm
Inferring Missing Parental Genotypes
• Test for Association in the Presence of Linkage (APL) (Martin et al 2003; Chung et al 2006)– Based on difference between counts of transmitted and
nontransmitted alleles– With missing parents APL estimates expected count in
nontransmitted alleles using EM algorithm
Results
Interesting
HLA
IMSGC, N Engl J Med. 2007 Aug 30;357(9):851-62
Association Software
• Genetics Specific: PLINK (Purcell et al)• Standard Statistical packages
– R statistical programming language– SAS
• The programming challenge is handling/looping through thousands of markers
Other Designs
• Longitudinal Studies (prospective cohorts)– Very very expensive (-)– Fewer biases in estimates (+)
• Founder or Isolated Populations– Often less genetic heterogeneity (+)– Less likely to be able to apply results to other populations (-)
• Imputation Analyses– Infer genotypes at unobserved loci (+)– Infer genotypes at unobserved loci (-)
• Meta-analyses– Combine results across datasets (++++)– Virtually all major genetics consortia use these methods
Other Analyses
Secondary Research Questions• Do alleles interact with other alleles to influence
phenotypes? (gene x gene interaction)• Do alleles interact with environmental factors? (gene x
environment interaction)• Are sets of alleles associated with disease? (pathway
analyses, gene-based tests)• Do certain alleles predict disease? (risk prediction, risk
score analyses)• Do certain alleles influence mRNA abundance (expression-
QTL analyses)
Thanks!Any questions?