Population genetics

46
Population genetics Nicky Mulder Acknowledgments -some slides based on course from Noah Zaitlen

description

Population genetics. Nicky Mulder Acknowledgments -some slides based on course from Noah Zaitlen. Contents. Background Linkage disequilbrium SNP tagging Population studies Accessing data. What is population genetics?. - PowerPoint PPT Presentation

Transcript of Population genetics

Page 1: Population genetics

Population genetics

Nicky Mulder

Acknowledgments -some slides based on course from Noah Zaitlen

Page 2: Population genetics

Contents

Background Linkage disequilbrium SNP tagging Population studies Accessing data

Page 3: Population genetics

What is population genetics?Population genetics is the study of genetic

variation both within and between human populations.

5-7% of worldwide human genetic variation is due to genetic differences between human populations.

The remaining 93-95% of human genetic variation is due to genetic variation within human populations (Rosenberg et al. 2002 Science).

Page 4: Population genetics

Why study population genetics?

Learn about human migration patterns and history Improve power to identify and localize disease genes Use differences in linkage disequilibrium for fine-

mapping Avoid false positives due to population stratification Admixture mapping for diseases with varying

prevalence Signals of natural selection at genes related to

disease

Page 5: Population genetics

Definition: Linkage Disequilibrium (LD) refers to correlations between genotypes of nearby markers.

Linkage Disequilibrium can be used for association studies

Linkage Disequilibrium

Page 6: Population genetics

Linkage Disequilibrium: Example

Individuals

A A G A T T A A C G T T G C C A A A

A A G G T T A A C C T T G G T T A A

SNP 1

SNP 23 billion letters

A A G G T T A A C C T T G G C T A A

A A A A T T A A G G T T G G T C A A

A A G G T T A A C C T T G G T T A A

A A G A T T A A C G T T G G C T A A

A A G G T T A A C C T T G G C T A A

A A G A T T A A C C T T G G C C A A

YES, in LD

Page 7: Population genetics

Linkage Disequilibrium: Example

Individuals

A A G A T T A A C G T T G G C C A A

A A G G T T A A C C T T G G T T A A

SNP 1

SNP 23 billion letters

A A G G T T A A C C T T G G C T A A

A A A A T T A A G G T T G G T C A A

A A G G T T A A C C T T G G T T A A

A A G A T T A A C G T T G G C T A A

A A G G T T A A C C T T G G C T A A

A A G A T T A A C C T T G G C C A A

SNP 3

YES, in LD

NOT in LD

Page 8: Population genetics

Linkage Disequilibrium: Example

0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0

SNP 1

SNP 23 billion letters

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0

0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

SNP 3

r2=1, in LDr2=0,NOT in LD

r2 is squared correlation

Individuals

Page 9: Population genetics

LD: Haplotype Blocks

0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

SNP 1

SNP 23 billion letters

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SNP 3

These 3 SNPs form a “haplotype block”

Individuals

Page 10: Population genetics

Haplotype blocks Variants (alleles) that are located close to one another

are inherited together (in LD) This pattern is disrupted by recombination (shuffles

chromosomes) Recombination is finite and time since origin of humans

insufficient to break down all linkage => haplotype blocks African chromosomes: 50% of the genome lies in

haplotype blocks >22kb. Europeans and Asians: 50% of the genome lies in

haplotype blocks >44kb. Longer haplotype blocks in Europeans/Asians due to

out-of-Africa population bottleneck: descended from small number of ancestors who left Africa 60-40 kya.

Gabriel et al. 2002 Science (also see Reich 2001 Nature, Daly 2001 Nat Genet)

Page 11: Population genetics

Population bottlenecks

Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,

Green et al. 2010 Science (Neanderthal genome), Reich et al. 2010 Nature (Denisova)

Page 12: Population genetics

Linkage Disequilibrium and tag SNPs

Individuals Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

SNP 1: causal SNP

3 billion letters

Direct association: genotype SNP1 in Cases and Controls.

Page 13: Population genetics

Linkage Disequilibrium and tag SNPs

Individuals Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

SNP 1

3 billion letters

Indirect association: genotype SNP2 in Cases and Controls.If SNP1 affects disease risk, then SNP2 will also be associated!

SNP 2

r2=1, in LD

Page 14: Population genetics

LD: Haplotype Blocks

ControlCase

Case

Case

Case

Control

Control

Control

Risk haplotype

Question: Which SNP to genotype?

Answer: Choose 1 SNP per haplotype block,and take advantage of indirect association! Use known resources

Case Control

Page 15: Population genetics

HapMap for “SNP tagging

How to select SNPs to genotype in an association study: Choose genomic region(s) of interest. Look up HapMap SNPs in the genomic region(s) Choose a subset of HapMap SNPs which “tag”

haplotype blocks in the genomic region(s)

Note: because LD patterns vary by population, it is important to choose tag SNPs using a HapMap population similar to the population in the association study

Page 16: Population genetics

SNP Imputation

Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

Page 17: Population genetics

SNP Imputation cont.

r2 = 0.8

Causal SNP

Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

Page 18: Population genetics

Causal SNP

Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

SNP Imputation cont.

Page 19: Population genetics

Why do Imputation?

Increase power to detect disease association at untyped causal SNP (imputed causal SNP may have stronger association than tag SNP)

Enable meta-analysis of studies on Affymetrix + Illumina chips

Improve genotype data quality Imputation algorithms available

Page 20: Population genetics

Population studies Population structure: refers to genetic differences

between populations due to geographic ancestry. Use genome-wide data to classify genome-wide ancestry

Population admixture:Mixed ancestry from multiple continental populations. e.g. African Americans, Latino AmericansClassify local ancestry at each location in the genome

Population stratification: Refers specifically to a genotype-phenotype association study. Differences in genetic ancestry between cases and controls

Page 21: Population genetics

Rosenberg et al. 2010, Nat Rev Genet

Previous GWAS studies

Page 22: Population genetics

The International HapMap Project 270 samples from 4 populations Found 3.1 million SNPs

CEU northern European

USA 90

CHB Chinese China 45

JPT Japanese Japan 45

YRI Yoruba Nigeria 90

Page 23: Population genetics

HapMap3

CEU northern European USA 180

CHB Chinese China 90

JPT Japanese Japan 90

YRI Yoruba Nigeria 180

TSI Tuscan Italy 90

CHD Chinese USA 100

LWK Luhya Kenya 90

MKK Maasai Kenya 180

ASW African-American USA 90

MXL Mexican-American USA 90

GIH Gujarati-American USA 90

Page 24: Population genetics

Measuring distances- FST

The FST between two populations is the value such that the allele frequency difference between the two populations has mean 0 and variance 2FSTp(1 – p), where p is the allele frequency in the ancestral population.

OR The FST between two populations is equal to the

proportion of genotypic variance in a set of N individuals from each population that is attributable to population differences.

High FST implies a high degree of differentiation

Page 25: Population genetics

Some example FSTs

Northwest Eur.Southeast Eur.FST = 0.005

FST = 0.007 Chinese Japanese

FST = 0.008 Yoruba (Nigeria)

Luhya (Kenya)

Page 26: Population genetics

More FST for HapMap3

HapMap3 population Closest pop. from HapMap

FST

TSI (Tuscan) CEU 0.004

CHD (Chinese) CHB 0.001

LWK (Luhya) YRI 0.008

MKK (Maasai) YRI 0.03

ASW (African-American) YRI 0.01

MXL (Mexican-American) CEU 0.04

GIH (Gujarati-American) CEU 0.04

Page 27: Population genetics

Studying population structure

Can study population structure by:Principal Component Analysis

(PCA)Clustering

Page 28: Population genetics

Principal Components Analysis

• •

• •

10 points in 1,000,000-dimensional space.

Page 29: Population genetics

• •

• •

Axis 1

Axis 1 is the axis explaining the maximum amount of variation.

Axes of variation (PCs, eigenvectors)

Page 30: Population genetics

Axes of variation (PCs, eigenvectors)

• •

• •

Axis 1

Axis 2

Page 31: Population genetics

• •

• •

Axis 1

Axis 2Axis 3

Axes of variation (PCs, eigenvectors)

Page 32: Population genetics

• •

• •

Axis 1

Axis 2

Axis 10

Axis 9

Axis 3

Axes of variation (PCs, eigenvectors)

Page 33: Population genetics

Distinguishing populations using PCA

100 markers

Page 34: Population genetics

3 million markers

Distinguishing populations using PCA

Page 35: Population genetics

PCA in Europe

Novembre et al. 2008 Nature

Page 36: Population genetics

Population structure using clustering

• Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics)

Rosenberg et al. 2002 Science

Page 37: Population genetics

More examples

Africa Europe Western Eurasia East Asia

Oce

ania

Am

eric

a

Page 38: Population genetics

Clustering versus PCA

Model-based clustering:• Output for each individual: ancestry in N population clusters• Fractional ancestry (20% pop1, 80% pop2) may be allowed• Number N of population clusters must be decided in advance• Results may be sensitive to number of population clusters

Principal components analysis (PCA):• Output for each individual: ancestry as principal components• PCs do not necessarily correspond to specific populations

Page 39: Population genetics

Ancestry Informative Markers Standard approach to inferring genetic ancestry:

Genotype each individual on a GWAS chip (500,000-1,000,000 random genetic markers).

Apply model-based clustering or PCA.

OR AIM approach to inferring genetic ancestry:

Genotype each individual on a small set of 50-300 AIMs: markers that are highly informative for genetic ancestry.

Apply model-based clustering or PCA.

Page 40: Population genetics

Working with the data Public data available in e.g.

HapMap 1000 GenomesdbSNP Ensembl, etc

Can be retrieved and used with user-owned data

Page 41: Population genetics

The 1000 Genomes Project Aims

Characterize allele frequencies and linkage disequilibrium patterns of 95% of variants with allele frequency >1%.

Pioneer and evaluate methods for generating and analyzing data from next-generation sequencing platforms.

Sequence the entire genomes of 2,500 individuals: 500 from Europe, East Asia, West Africa, South Asia and the Americas

Used next-generation sequencing technologies: e.g. Illumina/Solexa, 454, SOLiD (read lengths 25-400bp)

How much coverage is needed?

1000 Genomes Project Consortium 2010 Nature

Page 42: Population genetics

1000 Genomes Pilot Projects

1. Trio pilot project: Sequence 1 CEU trio (mother, father, child) and 1 YRI trio (mother, father, child)

at high coverage: >40x.

2. Low-coverage pilot project: Sequence 60 unrelated CEU and 60 unrelated YRI at low coverage: about 4x.

3. Exon sequencing pilot project: Exon capture sequencing of 8,140 exons from 906 genes in 697 individuals from 7 populations.

Page 43: Population genetics

Implications of genetic diversity

Gene expression: Microarrays, RNASeq –thousands of data points

Protein abundance: Mass spectrometry –thousands of possibilities

Environmental data: socio-economic impact

Pathways and interactions: binary and directed interactions

Potential disease phenotype

Page 44: Population genetics

Applications of genetic variation

Disease association studies depend on population group & genetic diversity

Genome wide association studies (GWAS)>1000 cases + >1000 controls Identify SNPs with significantly different

frequencies between the groupsCorrelate this with the disease phenotype

Pharmacogenetics

Page 45: Population genetics

Pharmacogenetics

Most drugs only work for 40% people Most likely related to population specificity Some drugs cause adverse drug reactions related to

SNPs in metabolizing enzymes (60%) IRESSA cancer drug only works in 10% patients, but

higher success rate in Japan BiDil for heart failure –only approved for African

Americans Warfarin anticoagulant –variations caused by SNPs in

CYP2C9 or VKORC1

Page 46: Population genetics

Summary Population genetics is a large field! Used for:

Identifying population structure for history and medical studies

Looking for ancestry, e.g. in admixed populations

Use LD, haplotypes and population structure in disease association studies

pharmacogenetics