2014 agbt giab data integration poster 140206
-
Upload
genomeinabottle -
Category
Documents
-
view
633 -
download
1
Transcript of 2014 agbt giab data integration poster 140206
SNPs indels
Genome in a Bottle Consortium• As sequencing moves to clinical
applications, assessing accuracy becomes very important.
• With the Genome in a Bottle Consortium, NIST is developing methods to characterize whole genome Reference Materials that can be used to assess the performance of whole genome sequencing
• Data from multiple sequencing platforms and runs can be used to understand and compensate for errors and biases of each method
• We propose a method using 14 datasets for CEPH/HapMap sample NA12878 to find characteristics of highly confident genotype calls and use these characteristics to arbitrate between discordant calls
Genome in a Bottle: Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls
Justin Zook1, Brad Chapman2, Oliver Hofmann2, Winston Hide2, Jason Wang3, David Mittelman3, Marc Salit1
1
1National Institute of Standards and Technology, Gaithersburg, MD2Harvard School of Public Health, Cambridge, MA; 3Arpeggi, Inc., Austin, TX
Variant list,Performance
metrics
SamplePreparation
Sequencing
Bioinformatics
Samples
NA12878 Data sets
Performance assessment using integrated calls• Calls hosted on GCAT website
• www.bioplanet.com/gcat• Interactive comparison of bioinformatics
methods to our integrated calls
• Using microarrays to assess performance underestimates FN rate• Integrated calls have >20x higher percentage
of low complexity regions than microarrays
• Freebayes has significantly improved its indel calls over the past year:
Discussion• Genome in a Bottle Consortium
• New members welcome!• www.genomeinabottle.org
Spike-ins
Integrating SNPs & indels
Overlap of SNP calls for NA12878 between three variant call files.(a) The three variant calls come from: (1) Illumina HiSeq reads mapped with bwa and with variants called by GATK; (2) the same Illumina HiSeq reads mapped with bwa but with variants called by samtools; (3) Complete Genomics called with CGTools 2.0.(b) The samtools calls are replaced by SOLiD 4 reads called with GATK. The gray numbers in parentheses are the numbers of variants that are not filtered in the other datasets.
Characteristics of bias used for arbitration• Systematic sequencing errors (SSEs)
• Strand bias• Base Quality Rank Sum
• Local Alignment• Distance from end of read• Mean position within read• Read Position Rank Sum• HaplotypeScore• Length of aligned reads
• Mapping problems• Mapping Quality• Abnormal coverage – CNV• Length of aligned reads
• Abnormal allele balance• Allele Balance • Quality/Depth
CompleteGenomics
IlluminaHiSeq
Performance Assessment• Within “highly confident” regions, all
datasets are highly sensitive and specific
• Most “false” positives and negatives appear to be microarray errors
…Dataset #1 Dataset #14
UnifiedGenotyper
Haplotype Caller
UnifiedGenotyper
Haplotype CallerCortex
Candidate SNP & indel sites
Force calls with UnifiedGenotyper
Force de novo assembly with
Haplotype Caller
…
… Force calls with UnifiedGenotyper
Force de novo assembly with
Haplotype Caller
Integrate UG and HC calls for
dataset #11…Integrate UG
and HC calls for dataset #1
Find high-confidence SNP & indel sitesHomRef
SNP VQSR
Het SNP VQSR
HomVar SNP
VQSR
HomVar indel VQSR
Het indel VQSR
HomRef indel VQSR
…HomRef
SNP VQSR
Het SNP VQSR
HomVar SNP
VQSR
HomVar indel VQSR
Het indel VQSR
HomRef indel VQSR
Arbitrate using characteristics of mapping and alignment bias and systematic sequencing errors to find consensus SNP & indel sites
Indels/Complex Variants• Multiple correct
representations of complex variants often exist
• Comparing complexvariants is difficult. Try RTG’s vcfeval!
Filter sites if <2 datasets are free of bias
Genome in a Bottle Consortium
a http://genomeinabottle.org/blog-entry/existing-and-future-na12878-datasets.
Pedigree Methods• Real Time Genomics and Illumina
Platinum Genomes have developed methods to use the 11 children of NA12878
• High-confidence variants are in haplotypes that are properly inherited in the children
Structural Variants• Can we use similar methods for SVs?• Arbitrate using coverage, insert size,
discordant paired ends, mapping quality, soft-clipping, heterozygous/homozygous ratio, allele fraction, …
• How to use long-read technologies?
CAGTGA > TCTCT complex variant