SMaSHa Benchmarking Toolkit
for Variant Calling
SMaSH and GIAB:A Good Match
High overlap with features in the Performance Metrics Specificationsdoc.
Many of the features not currently supported are ones we'd like tointegrate.
About meWorked at the UC Berkeley AMPlab for about a yearCurrently the primary SMaSH developerStarting a CS PhD at Berkeley in Programming Languages this fall
About this talkSMaSH as it is nowSMaSH in the futureSMaSH and GIAB
SMaSH as it isnow
SMaSHProject out of the AMP-X group at UC Berkeley
Talwalkar et al., 2014, Bioinformaticssmash.cs.berkeley.edu
Initial goalCreate a unified way of benchmarking germline variant calling pipelines.
SMaSH componentsCodebase for comparing VCF callsetsReads and ground truth datasetsMetrics for accuracy and computational performance
CodebaseFor benchmarking purposes, we compare a predicted callset against a
ground truth callsetComparing two predicted callsets works exactly the same.
Variant ClassificationSNPsIndel (less than 50 base pairs)Structural variants
EvaluationSNPs and indels are strictly evaluated.Structural variants are evaluated on:
Same type (insertion/deletion/other)Length same as true variant within specified tolerancePosition same as true variant within specified tolerance
Accuracy MetricsEvaluate variants as true positive, false positive, false negative
Evaluate accuracy of genotyping
Error barsCalculated on confidence in ground truth calls
Choose some upper bound on ground truth call error rate based onvalidation methology
E.g., 2 out of every 1000 SNPs is wrong.Use this error rate to calculate upper/lower bounds on precision and
recall.
The VCF format isambiguous!
SMaSH addresses this problem with two strategies:NormalizationRescue
Guiding principle: metrics should never be worse afternormalization/rescue than they were without them.
NormalizationA single variant may be plausibly placed in many different positions but
describe the same change.
For example, we normalize this variant:
First, we remove the longest proper suffix from the ref and alt alleles.
Then, we "slide" the variants by adding a base from the reference to thehead and removing a base from the tail, until the last bases on both
alleles are no longer the same.
RescueThe same underlying haplotype can be represented by different sets of
variants.True callset
Predicted callset
Rescue AlgorithmFor every false negative, we attempt rescue:
Build up a window around the variant positive for the true andpredicted callsetsFor all sets of non-overlapping variants, expand the underlyinghaplotypes for the variants within those windows.If the haplotypes match, mark all false negatives/false positives astrue positives.
Rescue Example
OutputsStatistics, including counts for all categories, in plain text, TSV andJSON formatsCalculations for precision and recall, including error barsVCF containing variants from both callsets, annotated with the callsetthey came from and their categorization (TP/FP/FN/rescued)
Where is SMaSH headed?
Global Alliance for Genomics &Health
ga4gh.orgThe benchmarking task force includes:
Illumina, Amazon, GoogleUC Berkeley, UC Santa Cruz, NIST
Development continues byGA4GH
Chief maintainers will be Kelly Westbrooks and Cassie Doll (Google).
Feature RoadmapNew variant types: complex variants, compound heterozygous variants,
etc.Phasing evaluation
Better handling of known false positives
SMaSH and GIAB
Try it and let us know what youthink!
git clone https://github.com/amplab/smash.gitComplete documentation available at smash.cs.berkeley.edu
Post feedback at the Google Group smash-benchmarking
Code contributionsOpen source and BSD-licensed; pull requests and issues very welcome
github.com/amplab/smash
DatasetsThe SMaSH paper proposed eight datasets, including synthetic, sampled
human, and mouse.Other data to use as ground truth?
NIST pedigree calls for NA12878the Illumina Platinum Genome
Others?
Interpretationof results
Tools for Downstream Analysis?Visualizations?
Compatibility with genome browsers?Other?
[email protected]/amplab/smash