Population genetics
description
Transcript of Population genetics
Population genetics
Nicky Mulder
Acknowledgments -some slides based on course from Noah Zaitlen
Contents
Background Linkage disequilbrium SNP tagging Population studies Accessing data
What is population genetics?Population genetics is the study of genetic
variation both within and between human populations.
5-7% of worldwide human genetic variation is due to genetic differences between human populations.
The remaining 93-95% of human genetic variation is due to genetic variation within human populations (Rosenberg et al. 2002 Science).
Why study population genetics?
Learn about human migration patterns and history Improve power to identify and localize disease genes Use differences in linkage disequilibrium for fine-
mapping Avoid false positives due to population stratification Admixture mapping for diseases with varying
prevalence Signals of natural selection at genes related to
disease
Definition: Linkage Disequilibrium (LD) refers to correlations between genotypes of nearby markers.
Linkage Disequilibrium can be used for association studies
Linkage Disequilibrium
Linkage Disequilibrium: Example
Individuals
A A G A T T A A C G T T G C C A A A
A A G G T T A A C C T T G G T T A A
SNP 1
SNP 23 billion letters
A A G G T T A A C C T T G G C T A A
A A A A T T A A G G T T G G T C A A
A A G G T T A A C C T T G G T T A A
A A G A T T A A C G T T G G C T A A
A A G G T T A A C C T T G G C T A A
A A G A T T A A C C T T G G C C A A
YES, in LD
Linkage Disequilibrium: Example
Individuals
A A G A T T A A C G T T G G C C A A
A A G G T T A A C C T T G G T T A A
SNP 1
SNP 23 billion letters
A A G G T T A A C C T T G G C T A A
A A A A T T A A G G T T G G T C A A
A A G G T T A A C C T T G G T T A A
A A G A T T A A C G T T G G C T A A
A A G G T T A A C C T T G G C T A A
A A G A T T A A C C T T G G C C A A
SNP 3
YES, in LD
NOT in LD
Linkage Disequilibrium: Example
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
SNP 1
SNP 23 billion letters
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SNP 3
r2=1, in LDr2=0,NOT in LD
r2 is squared correlation
Individuals
LD: Haplotype Blocks
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SNP 1
SNP 23 billion letters
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SNP 3
These 3 SNPs form a “haplotype block”
Individuals
Haplotype blocks Variants (alleles) that are located close to one another
are inherited together (in LD) This pattern is disrupted by recombination (shuffles
chromosomes) Recombination is finite and time since origin of humans
insufficient to break down all linkage => haplotype blocks African chromosomes: 50% of the genome lies in
haplotype blocks >22kb. Europeans and Asians: 50% of the genome lies in
haplotype blocks >44kb. Longer haplotype blocks in Europeans/Asians due to
out-of-Africa population bottleneck: descended from small number of ancestors who left Africa 60-40 kya.
Gabriel et al. 2002 Science (also see Reich 2001 Nature, Daly 2001 Nat Genet)
Population bottlenecks
Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,
Green et al. 2010 Science (Neanderthal genome), Reich et al. 2010 Nature (Denisova)
Linkage Disequilibrium and tag SNPs
Individuals Cases Controls
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
SNP 1: causal SNP
3 billion letters
Direct association: genotype SNP1 in Cases and Controls.
Linkage Disequilibrium and tag SNPs
Individuals Cases Controls
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0
SNP 1
3 billion letters
Indirect association: genotype SNP2 in Cases and Controls.If SNP1 affects disease risk, then SNP2 will also be associated!
SNP 2
r2=1, in LD
LD: Haplotype Blocks
ControlCase
Case
Case
Case
Control
Control
Control
Risk haplotype
Question: Which SNP to genotype?
Answer: Choose 1 SNP per haplotype block,and take advantage of indirect association! Use known resources
Case Control
HapMap for “SNP tagging
How to select SNPs to genotype in an association study: Choose genomic region(s) of interest. Look up HapMap SNPs in the genomic region(s) Choose a subset of HapMap SNPs which “tag”
haplotype blocks in the genomic region(s)
Note: because LD patterns vary by population, it is important to choose tag SNPs using a HapMap population similar to the population in the association study
SNP Imputation
Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet
SNP Imputation cont.
r2 = 0.8
Causal SNP
Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet
Causal SNP
Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet
SNP Imputation cont.
Why do Imputation?
Increase power to detect disease association at untyped causal SNP (imputed causal SNP may have stronger association than tag SNP)
Enable meta-analysis of studies on Affymetrix + Illumina chips
Improve genotype data quality Imputation algorithms available
Population studies Population structure: refers to genetic differences
between populations due to geographic ancestry. Use genome-wide data to classify genome-wide ancestry
Population admixture:Mixed ancestry from multiple continental populations. e.g. African Americans, Latino AmericansClassify local ancestry at each location in the genome
Population stratification: Refers specifically to a genotype-phenotype association study. Differences in genetic ancestry between cases and controls
Rosenberg et al. 2010, Nat Rev Genet
Previous GWAS studies
The International HapMap Project 270 samples from 4 populations Found 3.1 million SNPs
CEU northern European
USA 90
CHB Chinese China 45
JPT Japanese Japan 45
YRI Yoruba Nigeria 90
HapMap3
CEU northern European USA 180
CHB Chinese China 90
JPT Japanese Japan 90
YRI Yoruba Nigeria 180
TSI Tuscan Italy 90
CHD Chinese USA 100
LWK Luhya Kenya 90
MKK Maasai Kenya 180
ASW African-American USA 90
MXL Mexican-American USA 90
GIH Gujarati-American USA 90
Measuring distances- FST
The FST between two populations is the value such that the allele frequency difference between the two populations has mean 0 and variance 2FSTp(1 – p), where p is the allele frequency in the ancestral population.
OR The FST between two populations is equal to the
proportion of genotypic variance in a set of N individuals from each population that is attributable to population differences.
High FST implies a high degree of differentiation
Some example FSTs
Northwest Eur.Southeast Eur.FST = 0.005
FST = 0.007 Chinese Japanese
FST = 0.008 Yoruba (Nigeria)
Luhya (Kenya)
More FST for HapMap3
HapMap3 population Closest pop. from HapMap
FST
TSI (Tuscan) CEU 0.004
CHD (Chinese) CHB 0.001
LWK (Luhya) YRI 0.008
MKK (Maasai) YRI 0.03
ASW (African-American) YRI 0.01
MXL (Mexican-American) CEU 0.04
GIH (Gujarati-American) CEU 0.04
Studying population structure
Can study population structure by:Principal Component Analysis
(PCA)Clustering
Principal Components Analysis
• •
•
•
•
•
•
• •
•
10 points in 1,000,000-dimensional space.
• •
•
•
•
•
•
• •
•
Axis 1
Axis 1 is the axis explaining the maximum amount of variation.
Axes of variation (PCs, eigenvectors)
Axes of variation (PCs, eigenvectors)
• •
•
•
•
•
•
• •
•
Axis 1
Axis 2
• •
•
•
•
•
•
• •
•
Axis 1
Axis 2Axis 3
Axes of variation (PCs, eigenvectors)
• •
•
•
•
•
•
• •
•
Axis 1
Axis 2
Axis 10
Axis 9
Axis 3
…
Axes of variation (PCs, eigenvectors)
Distinguishing populations using PCA
100 markers
3 million markers
Distinguishing populations using PCA
PCA in Europe
Novembre et al. 2008 Nature
Population structure using clustering
• Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics)
Rosenberg et al. 2002 Science
More examples
Africa Europe Western Eurasia East Asia
Oce
ania
Am
eric
a
Clustering versus PCA
Model-based clustering:• Output for each individual: ancestry in N population clusters• Fractional ancestry (20% pop1, 80% pop2) may be allowed• Number N of population clusters must be decided in advance• Results may be sensitive to number of population clusters
Principal components analysis (PCA):• Output for each individual: ancestry as principal components• PCs do not necessarily correspond to specific populations
Ancestry Informative Markers Standard approach to inferring genetic ancestry:
Genotype each individual on a GWAS chip (500,000-1,000,000 random genetic markers).
Apply model-based clustering or PCA.
OR AIM approach to inferring genetic ancestry:
Genotype each individual on a small set of 50-300 AIMs: markers that are highly informative for genetic ancestry.
Apply model-based clustering or PCA.
Working with the data Public data available in e.g.
HapMap 1000 GenomesdbSNP Ensembl, etc
Can be retrieved and used with user-owned data
The 1000 Genomes Project Aims
Characterize allele frequencies and linkage disequilibrium patterns of 95% of variants with allele frequency >1%.
Pioneer and evaluate methods for generating and analyzing data from next-generation sequencing platforms.
Sequence the entire genomes of 2,500 individuals: 500 from Europe, East Asia, West Africa, South Asia and the Americas
Used next-generation sequencing technologies: e.g. Illumina/Solexa, 454, SOLiD (read lengths 25-400bp)
How much coverage is needed?
1000 Genomes Project Consortium 2010 Nature
1000 Genomes Pilot Projects
1. Trio pilot project: Sequence 1 CEU trio (mother, father, child) and 1 YRI trio (mother, father, child)
at high coverage: >40x.
2. Low-coverage pilot project: Sequence 60 unrelated CEU and 60 unrelated YRI at low coverage: about 4x.
3. Exon sequencing pilot project: Exon capture sequencing of 8,140 exons from 906 genes in 697 individuals from 7 populations.
Implications of genetic diversity
Gene expression: Microarrays, RNASeq –thousands of data points
Protein abundance: Mass spectrometry –thousands of possibilities
Environmental data: socio-economic impact
Pathways and interactions: binary and directed interactions
Potential disease phenotype
Applications of genetic variation
Disease association studies depend on population group & genetic diversity
Genome wide association studies (GWAS)>1000 cases + >1000 controls Identify SNPs with significantly different
frequencies between the groupsCorrelate this with the disease phenotype
Pharmacogenetics
Pharmacogenetics
Most drugs only work for 40% people Most likely related to population specificity Some drugs cause adverse drug reactions related to
SNPs in metabolizing enzymes (60%) IRESSA cancer drug only works in 10% patients, but
higher success rate in Japan BiDil for heart failure –only approved for African
Americans Warfarin anticoagulant –variations caused by SNPs in
CYP2C9 or VKORC1
Summary Population genetics is a large field! Used for:
Identifying population structure for history and medical studies
Looking for ancestry, e.g. in admixed populations
Use LD, haplotypes and population structure in disease association studies
pharmacogenetics