genetic variation is meaningful only in the context of a population

genetic variation is meaningfulonly in the context of a population

the minor allele frequency f says how often a particular allele (variant) occurs at a particular site in a given population; by definition it is < 0.5

ccagtcagagAtgtgcacatggcttagttttcatacaGagcctgggctgggggtggggtgccagtcagagAtgtgcacatggcttagttttcatacacagcctgggctgggggtggggtgccagtcagagttgtgcacatggcttagttttcatacacagcctgggctgggggtggggtgccagtcagagAtgtgcacatggcttagttttcatacacagcctgggctgggggtggggtgccagtcagagAtgtgcacatggcttagttttcatacacagcctgggctgggggtggggtgccagtcagagttgtgcacatggcttagttttcatacacagcctgggctCggggtggggtgccagtcagagttgtgcacatggcttagttttcatacacagcctgggctgggggtggggtgccagtcagagttgtgcacatgTcttagttttcatacacagcctgggctgggggtggggtgccagtcagagttgtgcacatggcttagttttcatacaGagcctgggctgggggtggggtgccagtcagagttgtgcacatggcttagttttcatacacagcctgggctgggggtggggtg

f = 4/10 f = 1/10 f = 2/10 f = 1/10

indi

vidu

als

1-10

the polymorphisms most analyzed are:

single nucleotide polymorphisms (SNPs) replace one bp with another but do not change lengths

1 SNP per 1000 bp between any two individuals; almost every bp is variable when we consider the world population

SNPs are essentially always bi-allelic; not because tri-allelics are impossible, just highly unlikely

other polymorphism categories include:

small insertions-deletions (indels) below reads length

large structural variations consisting of insertions and deletions and inversions above reads length

SNPs approximate 1/f distribution

this is the observed frequency distribution from the complete sequencing a large population; however many SNP discovery projects sequence a small population

and then consider the absence or presence of those previously discovered SNPs in a large population; this is known to underestimate the number of rare

variants

0 0.1 0.2 0.3 0.4 0.50

250

500

750

1000

f (minor allele)

# of

SNP

s

# of SNPs found = 1541

NonSyn Synon 5'-UTR 3'-UTR Frame Splice 5'-Flank3'-FlankIntron

population specific SNPs arefound at lower f than shared ones

minor allele frequencies classified by occurrence within individuals of either African or European descent (population specific) or presence in both (shared)

as shown by Halushka MK, …, Chakravarti A. 1999. Nat Genet 22: 239-347

GENOTYPE

first we identify variant sites by sequencing a small number of individuals; then we test (i.e. genotype) only those variant sites to determine which alleles are present

RE-SEQUENCE [inconsistently used terminology]

generate low coverage (2x) sequence from one individual and compare that data against the reference genome

DE NOVO

generate high coverage (50x) sequence from one individual and perform de novo assembly of that genome without making use of the existing reference genome

growth in public databases for “common” human polymorphisms

27 October 2005 one million SNPs genotyped in 269 individuals from

four diverse populations

28 October 2010 15 million SNPs, one million short insertion-deletion,

20000 structural variants genotyped in up to 697

individuals from 7 diverse populations worldwide

18 October 2007 3.1 million SNPs genotyped in 270 individuals from

four diverse populations

structural variations detected by fosmid end-sequence pairs (ESPs)

the fosmid cloning system generates an exceptionally narrow distribution of clone insert sizes, 40 ± 2.8 kb; each of these fosmid clones is sequenced from both ends, creating an end-sequence pair with two 500 bp sequence reads separated by a known distance in the test genome from which the fosmid clone was made; insertions-deletions-inversions are detected by computationally aligning end-sequence pairs to the reference genome

10 kb deletion relative to reference

50 kb on REF

REF genome

test genome

40 kb on test

deleted 10 kb

arrows indicate direction of

sequence read

structural variations follow a1/f distribution just like SNPs

15% (261 of 1,695) of discovered sites represent the more common configuration than the reference human genome

JM Kidd, et al. 2008. Mapping and sequencing of structural variation from eight human genomes. Nature 453: 56-64

human pan-genome: non-redundant collection of sequences found across the entire world’s human population

de novo assembly of individual genome reveals ~5 Mb of novel sequence not present in reference genome; complete human pan-genome contains 19~40 Mb of novel sequence not present in reference genome

R Li, et al. 2010. Building the sequence map of the human pan-genome. Nat Biotechnol 28: 57-63.

Lewontin’s (in)famous paper on non-existence of “race” in genetics

Lewontin RC. 1972. "The apportionment of human diversity“, in Evolutionary Biology 6: 391-398

most of the variations (85%) found in human populations is found within local geographic groups and any differences attributable to race groups is just a small fraction of human genetic variability (15%); race is an invalid taxonomic construct because the probability of a racial misclassification is approximately 30% based on a single genetic locus

Edwards AW. 2003. Human genetic diversity: Lewontin's fallacy. Bioessays 25: 798-801

even if the probability of misclassifying an individual’s race based on a single locus is as high as 30%, the misclassification probability based on 10 loci can drop to a few percent

Structure clustering of genotype data

Rosenberg NA, …, Feldman MW. 2002. Science 298: 2381-2385

This analysis is based on 377 microsatellites in 1056 individuals from 52 populations. Variations within populations account for 93 to 95% of the data. Nevertheless we can identify clusters that are consistent with known populations. K is chosen in advance. For any given K, each individual is represented by a thin vertical line, which is partitioned into K colored segments indicating the individual’s estimated membership in the preordained K clusters.

Africa AsiaEurope

science does notdictate public policy

science can and should inform policy but that is never the only consideration, and in the meantime there are better (or at least more fun) things to do

with the decreasing cost of sequencing, the age of personal genomics is fast approaching; we need not limit ourselves to sequencing live individuals

human genome sequencefrom an extinct Palaeo-Eskimo

evidence of migration from Siberia into the New World some 5,500 years ago independent of migrations giving rise to modern Native Americans and Inuits

M Rasmussen, et al. Feb 2010. Nature 463: 757-762.

Kennewick Man is but the tip of the iceberg for the New World Entrada controversy. Who occupied the American continents first and where did they come from? These questions are intricately connected with the rights of indigenous Native Americans. Sequencing of a pre-Clovis genome over 11,500 years old would rattle the field.

personal genomes for $199

Anne Wojcicki and Sergey Brin

genetic variation is meaningful only in the context of a population

Documents

Transcript of genetic variation is meaningful only in the context of a population