1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant...
Transcript of 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant...
![Page 1: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/1.jpg)
1000 Genomes Project:Datasets
Gabor MarthBoston College Biology Department
1000 Genomes Project TutorialASHG 2010, Washington, DCNovember 3, 2010
![Page 2: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/2.jpg)
3 pilot coverage strategies
![Page 3: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/3.jpg)
Samples
Population YRI LWK CHB CHD JPT CEU TSI All
Samples 112 108 109 107 105 90 66 697
![Page 4: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/4.jpg)
Pilot datasets
Populations Samples Coverage
2 6 20-40x
4 179 2-4x
7 697 20-50x
![Page 5: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/5.jpg)
Data processing / variant calling
REF
(ii) read mapping
IND
(i) base calling
IND(iii) SNP and short INDEL calling
(iv) SV calling
![Page 6: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/6.jpg)
>10% non-unique
mappingdepth
too highNo
coverage
>80% of the genome accessible with short reads
M & D
![Page 7: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/7.jpg)
PCR-duplicate reads
![Page 8: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/8.jpg)
Locally misaligned bases
![Page 9: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/9.jpg)
Un-calibrated base quality values
![Page 10: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/10.jpg)
SNP calling
![Page 11: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/11.jpg)
SNP calling (continued)
P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)
P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)
“SNP call”
“genotype call”
P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)
P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)
P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)
“genotype likelihoods”
Prior(
G1,.
.,G
i,.., G
n)
-----a-----
-----a-----
-----c-----
-----c-----
-----a-----
-----a-----
-----a-----
-----a-----
-----c-----
-----c-----
-----c-----
-----c-----
-----c-----
![Page 12: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/12.jpg)
Data processing / variant calling pipeline
![Page 13: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/13.jpg)
Validation by typing a random sample of novel variants
Overall FDR < 5% (10% for large SVs)
SNP calls from the 3 pilot datasets
Trios Low coverage Exon pilot
Samples 6 179 697
Raw data 1.08 Tb 2.22 Tb 1.43 Tb
SNPs found4.03M (CEU)5.01M (YRI)
14.5M 12,761
% novel15% (CEU)29% (YRI)
55% 70%
Short indels 0.68 M 1.12 M -
Deletions ~10,000 15,765 -
SV breakpts 6,169 9,092 -
Mobile element insertions
2,528 4,774 -
![Page 14: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/14.jpg)
Imputation helps genotype calls
![Page 15: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/15.jpg)
Power (sensitivity)
![Page 16: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/16.jpg)
Novel variants
![Page 17: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/17.jpg)
• 3-4,000,000 variants
• 10-11,000 nonsynonymous changes
• 220-250 in-frame indels
• 80-100 premature stop codons
• 40-50 splice site disruptions
• 50-100 HGMD “recessive disease causing” mutations
Variants per sample genome
![Page 18: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/18.jpg)
Exon Pilot: high sensitivity for rare variants
![Page 19: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/19.jpg)
Exon Pilot: most sites low-frequency and novel
![Page 20: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/20.jpg)
1000G data also supports structural variants
![Page 21: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/21.jpg)
Opportunity: different variants from the same data
![Page 22: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/22.jpg)
Data types delivered
Reads: FASTQ
Alignments: SAM/BAM
Variants: VCF
![Page 23: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/23.jpg)
Tools for analyzing / manipulating 1000G data
• samtools: http://samtools.sourceforge.net/• BamTools: http://sourceforge.net/projects/bamtools/• GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
• VCFTools: http://vcftools.sourceforge.net/
Alignments: SAM/BAM
Variants: VCF
![Page 24: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/24.jpg)
Alignment visualization
IGV viewer, GAMBIT viewer
![Page 25: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/25.jpg)
Current status based on 629 samples
Samples # SNPs FN metrics
Known Novel Total dbSNP missed HM
629 7,922,125 17,564,935 25,487,060 31.08% 1.21%
• As of 11/02/2010• Calls present in at least 2 of Broad Institute, University of Michigan, NCBI, and Boston College call sets
![Page 26: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/26.jpg)
CEU
JPTCHB
YRI
LWK
MXL
ASW
GBRFIN
TSI
CHS
CLM
PUR
1,100 samples early 2011; 2,500 samples 2011/12
IBSCDX
KHVGWD
ACB
AJM
PEL
PJL
MAB
ADHKAKRDHMRM
The full 1000 Genomes Project data
![Page 27: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/27.jpg)
YRI • Low-coverage WGS (~4x per sample): a near-complete SNP catalog in the genome AF > 1%
• The deep-coverage WG exomes: rare variants, i.e. AF < 1% in genes
Complementary strategies
![Page 28: 1000 Genomes Project: Datasets Genomes Project Tutorial ASHG 2010 ... Data processing / variant calling pipeline. ... (10% for large SVs) SNP calls from the 3 pilot datasets Trios](https://reader031.fdocuments.net/reader031/viewer/2022022508/5ad01cb07f8b9a1d328de393/html5/thumbnails/28.jpg)