The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.
![Page 1: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/1.jpg)
The 1000 Genomes Project
Gil McVeanDepartment of Statistics, Oxford
![Page 2: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/2.jpg)
What is the 1000 Genomes Project?
• A catalogue of all types of genetic variation, including rare variants (c. 1% frequency) obtained by sequencing at least 1000 individuals from geographic centres of major medical genetics interest
• A large international collaboration– UK, USA, China, Germany
• An exploration of the use of next-generation technologies for population-scale genome sequencing
• A resource for accelerating the rate of identifying disease mechanisms in the follow-up to disease-association studies
![Page 3: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/3.jpg)
Samples for the main project
UKFIN
TSIESP
CEU
JPTCHB
CHS
DAI
KVTGMB
GHN
YRI
MLW
LWK
Major population groups comprised of subpopulations of c. 100 each
MXL
ASW
newCMB
PRO
![Page 4: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/4.jpg)
Population-scale genome sequencing
Haplotypes2x
10x
![Page 5: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/5.jpg)
Pilot experiments
• Pilot 1– Low-coverage (2x-4x) on 60 unrelated individuals from each of CEU, YRI and
CHB+JPT
• Pilot 2– High-coverage (20x diploid) on 2 trios (one from CEU, one from YRI)
• Pilot 3– Exons from 1000 genes to 20x in c. 1000 samples (largely European)
Complete!
![Page 6: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/6.jpg)
The 1000G Low Coverage Pilot
• 185 individuals from 4 populations– CEU (63), CHB (30), JPT (30), YRI (62)
Population Technology N Individuals Mapped Bases (billions)
Mean Coverage / Individual
CEU SLX 52 482 3.09SOLiD 30 240 2.66454 18 132 2.45
CHB SLX 30 234 2.60JPT SLX 28 227 2.70
454 2 9.6 1.60YRI SLX 60 594 3.30
SOLiD 5 20.6 1.38454 2 10.8 1.80
Combined 185 1,884 3.52
![Page 7: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/7.jpg)
Even still, at lot of data isn’t much
• In the Pilot 1 sample 1 tera-basepairs leaves the CEU with…– 6% of genotypes with 0 reads– 16% of genotypes with < 2 reads– 29% of genotypes with < 3 reads
![Page 8: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/8.jpg)
ftp.1000genomes.ebi.ac.uk
www.1000genomes.org
Pilot release expected Nov/Dec 2009
ftp-trace.ncbi.nih.gov/1000genomes/ftp
![Page 9: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/9.jpg)
![Page 10: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/10.jpg)
What has the project already generated?
![Page 11: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/11.jpg)
Over 9 millions novel SNPs
• Total 17.2 M SNPs called
• Previously ~12M SNPs “known” (dbSNP 129)
– 7.9M confirmed– 9.2M novel
4.84
1.09
0.78
0.48
2.80 5.65
1.54
CEU YRI
CHB+JPT
0.50
0.38
0.29 0.26
2.20 4.38
1.35
CEU YRI
CHB+JPT
Total SNPs Novel SNPs
Le Quang
![Page 12: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/12.jpg)
A near complete record of common SNPs
Durbin, Le Quang
![Page 13: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/13.jpg)
0.88
0.90
0.92
0.94
0.96
0.98
1.00
HomRef Het HomNonRef Average
CEU
JPT
CHB
YRI
A set of accurate genotypes
• This is about where simulations suggest we should be with 2-4x on 60 samples
• Note this quality is much much better than if calls were made marginally
Durbin, Le Quang
![Page 14: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/14.jpg)
Many novel indels and larger structural variants
![Page 15: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/15.jpg)
Ref-free in
serti
ons
Ref-assi
sted in
serti
ons
Ref-free deletions
Ref-assi
sted deletions
0
500
1000
1500
2000
2500
3000
Calls>50bpDGV1000g release
Zam Iqbal
Up to 50kb
Novel sequence from de novo assembly
![Page 16: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/16.jpg)
Some interesting biology - variation in SNP density
![Page 17: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/17.jpg)
Some more interesting biology – high Fst SNPs
Ryan Hernandez, Adam Auton
![Page 18: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/18.jpg)
Even more interesting biology – loss of function mutations
Daniel MacArthur
![Page 19: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/19.jpg)
A robust and modular pipeline for analysis of population-scale sequence data
![Page 20: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/20.jpg)
An efficient format for storing aligned reads and a set of tools to manipulate and view the files
• SAM/BAM format for storing (aligned) reads
Bioinformatics (2009) http://samtools.sourceforge.net
![Page 21: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/21.jpg)
An information-rich format for storing generic haplotype/genotype data and tools for manipulating the files
www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2
![Page 22: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/22.jpg)
Using the 1000G data now
![Page 23: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/23.jpg)
IMPUTE
Genotypes in additional samples from standard product
Reference panel(1000G)
Imputation
… 11101010101011 …… 00111110000111 …… 11110000011101 …… 00101011100101 … … 1.2..1.0.0..22…
… 11220110200122 … Imputed genotypes
![Page 24: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/24.jpg)
Imputation performance across SNP types from P1 (CEU) from Affy 500k
Annotation # SNPs Info measure
All 414,321 0.780
MAF < 5% 102,000 0.543
MAF > 5% 312,321 0.857
UCSC Genes 6,628 0.736
Depth < 100 3,153 (0.7%) 0.611
SimpRpts 25,625 0.607
SimpRpts + Depth < 100 1,652 (6.5%) 0.671
SegDups 24,301 0.686
SegDups + Depth < 100 665 (2.7%) 0.388
Jonathan Marchini
![Page 25: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/25.jpg)
Looking forward...
• Already have data generated for c. 200 more Europeans– Data generation largely complete by mid 2010
• Much work still to be done on accurate inference of all types of variation from NGS data
• Data already proven useful for a number of projects – please use it
![Page 26: The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.](https://reader038.fdocuments.net/reader038/viewer/2022110322/56649d2f5503460f94a06e24/html5/thumbnails/26.jpg)
Thanks to the many...