Population genetics

Population genetics

Nicky Mulder

Acknowledgments -some slides based on course from Noah Zaitlen

Contents

Background Linkage disequilbrium SNP tagging Population studies Accessing data

What is population genetics?Population genetics is the study of genetic

variation both within and between human populations.

5-7% of worldwide human genetic variation is due to genetic differences between human populations.

The remaining 93-95% of human genetic variation is due to genetic variation within human populations (Rosenberg et al. 2002 Science).

Why study population genetics?

Learn about human migration patterns and history Improve power to identify and localize disease genes Use differences in linkage disequilibrium for fine-

mapping Avoid false positives due to population stratification Admixture mapping for diseases with varying

prevalence Signals of natural selection at genes related to

disease

Definition: Linkage Disequilibrium (LD) refers to correlations between genotypes of nearby markers.

Linkage Disequilibrium can be used for association studies

Linkage Disequilibrium

Linkage Disequilibrium: Example

Individuals

A A G A T T A A C G T T G C C A A A

A A G G T T A A C C T T G G T T A A

SNP 1

SNP 23 billion letters

A A G G T T A A C C T T G G C T A A

A A A A T T A A G G T T G G T C A A


A A G A T T A A C G T T G G C T A A


A A G A T T A A C C T T G G C C A A

YES, in LD


Individuals

A A G A T T A A C G T T G G C C A A


SNP 1



A A A A T T A A G G T T G G T C A A


A A G A T T A A C G T T G G C T A A


A A G A T T A A C C T T G G C C A A

SNP 3

YES, in LD

NOT in LD


0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0

SNP 1


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0

0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

SNP 3

r2=1, in LDr2=0,NOT in LD

r2 is squared correlation

Individuals

LD: Haplotype Blocks

0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

SNP 1


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SNP 3

These 3 SNPs form a “haplotype block”

Individuals

Haplotype blocks Variants (alleles) that are located close to one another

are inherited together (in LD) This pattern is disrupted by recombination (shuffles

chromosomes) Recombination is finite and time since origin of humans

insufficient to break down all linkage => haplotype blocks African chromosomes: 50% of the genome lies in

haplotype blocks >22kb. Europeans and Asians: 50% of the genome lies in

haplotype blocks >44kb. Longer haplotype blocks in Europeans/Asians due to

out-of-Africa population bottleneck: descended from small number of ancestors who left Africa 60-40 kya.

Gabriel et al. 2002 Science (also see Reich 2001 Nature, Daly 2001 Nat Genet)

Population bottlenecks

Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,

Green et al. 2010 Science (Neanderthal genome), Reich et al. 2010 Nature (Denisova)

Linkage Disequilibrium and tag SNPs

Individuals Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

SNP 1: causal SNP

3 billion letters

Direct association: genotype SNP1 in Cases and Controls.

Linkage Disequilibrium and tag SNPs

Individuals Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

SNP 1

3 billion letters

Indirect association: genotype SNP2 in Cases and Controls.If SNP1 affects disease risk, then SNP2 will also be associated!

SNP 2

r2=1, in LD

LD: Haplotype Blocks

ControlCase

Case

Case

Case

Control

Control

Control

Risk haplotype

Question: Which SNP to genotype?

Answer: Choose 1 SNP per haplotype block,and take advantage of indirect association! Use known resources

Case Control

HapMap for “SNP tagging

How to select SNPs to genotype in an association study: Choose genomic region(s) of interest. Look up HapMap SNPs in the genomic region(s) Choose a subset of HapMap SNPs which “tag”

haplotype blocks in the genomic region(s)

Note: because LD patterns vary by population, it is important to choose tag SNPs using a HapMap population similar to the population in the association study

SNP Imputation

Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

SNP Imputation cont.

r2 = 0.8

Causal SNP


Causal SNP


SNP Imputation cont.

Why do Imputation?

Increase power to detect disease association at untyped causal SNP (imputed causal SNP may have stronger association than tag SNP)

Enable meta-analysis of studies on Affymetrix + Illumina chips

Improve genotype data quality Imputation algorithms available

Population studies Population structure: refers to genetic differences

between populations due to geographic ancestry. Use genome-wide data to classify genome-wide ancestry

Population admixture:Mixed ancestry from multiple continental populations. e.g. African Americans, Latino AmericansClassify local ancestry at each location in the genome

Population stratification: Refers specifically to a genotype-phenotype association study. Differences in genetic ancestry between cases and controls

Rosenberg et al. 2010, Nat Rev Genet

Previous GWAS studies

The International HapMap Project 270 samples from 4 populations Found 3.1 million SNPs

CEU northern European

USA 90

CHB Chinese China 45

JPT Japanese Japan 45

YRI Yoruba Nigeria 90

HapMap3

CEU northern European USA 180

CHB Chinese China 90

JPT Japanese Japan 90

YRI Yoruba Nigeria 180

TSI Tuscan Italy 90

CHD Chinese USA 100

LWK Luhya Kenya 90

MKK Maasai Kenya 180

ASW African-American USA 90

MXL Mexican-American USA 90

GIH Gujarati-American USA 90

Measuring distances- FST

The FST between two populations is the value such that the allele frequency difference between the two populations has mean 0 and variance 2FSTp(1 – p), where p is the allele frequency in the ancestral population.

OR The FST between two populations is equal to the

proportion of genotypic variance in a set of N individuals from each population that is attributable to population differences.

High FST implies a high degree of differentiation

Some example FSTs

Northwest Eur.Southeast Eur.FST = 0.005

FST = 0.007 Chinese Japanese

FST = 0.008 Yoruba (Nigeria)

Luhya (Kenya)

More FST for HapMap3

HapMap3 population Closest pop. from HapMap

FST

TSI (Tuscan) CEU 0.004

CHD (Chinese) CHB 0.001

LWK (Luhya) YRI 0.008

MKK (Maasai) YRI 0.03

ASW (African-American) YRI 0.01

MXL (Mexican-American) CEU 0.04

GIH (Gujarati-American) CEU 0.04

Studying population structure

Can study population structure by:Principal Component Analysis

(PCA)Clustering

Principal Components Analysis

• •

•

•

•

•

•

• •

•

10 points in 1,000,000-dimensional space.

• •

•

•

•

•

•

• •

•

Axis 1

Axis 1 is the axis explaining the maximum amount of variation.

Axes of variation (PCs, eigenvectors)


• •

•

•

•

•

•

• •

•

Axis 1

Axis 2

• •

•

•

•

•

•

• •

•

Axis 1

Axis 2Axis 3


• •

•

•

•

•

•

• •

•

Axis 1

Axis 2

Axis 10

Axis 9

Axis 3

…


Distinguishing populations using PCA

100 markers

3 million markers

Distinguishing populations using PCA

PCA in Europe

Novembre et al. 2008 Nature

Population structure using clustering

• Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics)

Rosenberg et al. 2002 Science

More examples

Africa Europe Western Eurasia East Asia

Oce

ania

Am

eric

a

Clustering versus PCA

Model-based clustering:• Output for each individual: ancestry in N population clusters• Fractional ancestry (20% pop1, 80% pop2) may be allowed• Number N of population clusters must be decided in advance• Results may be sensitive to number of population clusters

Principal components analysis (PCA):• Output for each individual: ancestry as principal components• PCs do not necessarily correspond to specific populations

Ancestry Informative Markers Standard approach to inferring genetic ancestry:

Genotype each individual on a GWAS chip (500,000-1,000,000 random genetic markers).

Apply model-based clustering or PCA.

OR AIM approach to inferring genetic ancestry:

Genotype each individual on a small set of 50-300 AIMs: markers that are highly informative for genetic ancestry.

Apply model-based clustering or PCA.

Working with the data Public data available in e.g.

HapMap 1000 GenomesdbSNP Ensembl, etc

Can be retrieved and used with user-owned data

The 1000 Genomes Project Aims

Characterize allele frequencies and linkage disequilibrium patterns of 95% of variants with allele frequency >1%.

Pioneer and evaluate methods for generating and analyzing data from next-generation sequencing platforms.

Sequence the entire genomes of 2,500 individuals: 500 from Europe, East Asia, West Africa, South Asia and the Americas

Used next-generation sequencing technologies: e.g. Illumina/Solexa, 454, SOLiD (read lengths 25-400bp)

How much coverage is needed?

1000 Genomes Project Consortium 2010 Nature

1000 Genomes Pilot Projects

1. Trio pilot project: Sequence 1 CEU trio (mother, father, child) and 1 YRI trio (mother, father, child)

at high coverage: >40x.

2. Low-coverage pilot project: Sequence 60 unrelated CEU and 60 unrelated YRI at low coverage: about 4x.

3. Exon sequencing pilot project: Exon capture sequencing of 8,140 exons from 906 genes in 697 individuals from 7 populations.

Implications of genetic diversity

Gene expression: Microarrays, RNASeq –thousands of data points

Protein abundance: Mass spectrometry –thousands of possibilities

Environmental data: socio-economic impact

Pathways and interactions: binary and directed interactions

Potential disease phenotype

Applications of genetic variation

Disease association studies depend on population group & genetic diversity

Genome wide association studies (GWAS)>1000 cases + >1000 controls Identify SNPs with significantly different

frequencies between the groupsCorrelate this with the disease phenotype

Pharmacogenetics

Pharmacogenetics

Most drugs only work for 40% people Most likely related to population specificity Some drugs cause adverse drug reactions related to

SNPs in metabolizing enzymes (60%) IRESSA cancer drug only works in 10% patients, but

higher success rate in Japan BiDil for heart failure –only approved for African

Americans Warfarin anticoagulant –variations caused by SNPs in

CYP2C9 or VKORC1

Summary Population genetics is a large field! Used for:

Identifying population structure for history and medical studies

Looking for ancestry, e.g. in admixed populations

Use LD, haplotypes and population structure in disease association studies

pharmacogenetics

Population genetics

Documents

Transcript of Population genetics