BIGGIE: A Distributed Pipeline for Genomic Variant Callingkubitron/courses/... · and use the right...

1
BIGGIE: A Distributed Pipeline for Genomic Variant Calling Richard Xia 1 , Sara Sheehan 1 , Yuchen Zhang 1 , Ameet Talwalkar 1 , Matei Zaharia 1 Jonathan Terhorst 2 , Michael Jordan 1,2 , Yun S. Song 1,2 , Armando Fox 1 , David Patterson 1 1 Computer Science Division, UC Berkeley; 2 Department of Statistics, UC Berkeley Motivation: faster, open-source genome variant calling tools Impact: Human genome variation is being used more and more to impact disease diagnosis and treatment, however: I current tools frequently disagree on variant calls I different types of variation require specialized tools Current Tools: I GATK [2]: slow and difficult to use I CASAVA [1]: fast, but not free I samtools mpileup [3]: slow and some accuracy issues Our Goal: I fast, distributed variant caller I separate the genome into regions of high and low complexity and use the right tool for the right region Figure 1: BIGGIE pipeline Per-base SNP caller Main idea: I Distrubted pipeline for variant calling using Spark [4] I Assign a complexity score to each base I Use a simple SNP caller at bases with a low complexity score I Use more robust structural variant callers at high complexity bases Complexity region examples: Figure 2: Different variant calling tools should be used for regions of the genome. Complexity score features: Name Weight Description Substitution 3 Number of aligned reads showing a sub- stitution with respect to the reference. Insertion 10 Number of aligned reads showing an in- sertion with respect to the reference. Deletion 10 Number of aligned reads showing a dele- tion with respect to the reference. Low Quality 3 Number of reads aligned with low map quality (a common indicator of a repeti- tive region). Table 1: Relative weight of features for computing complexity. Results Simulating data: I Used reads simulated from the consensus sequence for Venter’s genome I Better approximates the true pattern of SNPs, indels, and structural variants found in a true genome; reads were aligned using BWA and SNAP Effect of thresholds: Figure 3: On the left are the per-base results, measuring false negatives only on the regions we called. Both accuracy measures increase as the threshold increases, but the number of correct calls increases as well. We see a similar pattern on the right for the region results, where the number of false positives and false negatives increase with the density of complex bases in high-complexity regions, but the number of true calls increases as well. Incorporating high complexity regions Figure 4: Regions are fairly uniformly distributed, except near the chromosome ends. I We group bases into a high-complexity region in a greedy fashion, maintaining that the overall high-complexity base density is > t I We filter out regions that are < 500 bases long Stats, t = 5% Number of high complexity regions 3603 Percentage of genome is high complexity regions 16.6% Results g Timing Results: Algorithm Runtime GATK 35m 17s mpileup 49m 53s BIGGIE 4m 38s Table 2: Timing results for GATK, mpileup, and BIGGIE. The runtime is not significantly impacted by the complexity threshold. Low vs. High Complexity: region type false pos false neg correct low-complexity 1824 7455 38232 high-complexity 2289 2788 13046 Table 3: Our performance degrades in the high complexity regions, which is why a special purpose variant caller should be used. Figure 5: Accuracy comparison of BIGGIE with mpileup and GATK. False positives in BIGGIE are often associated with alignment errors or confusion with a small indel. For each algorithm, a very small percentage of correct SNP bases actually have the incorrect (unphased) genotype. Future Work: Use the high and low complexity regions to distribute the reads across machines, then call variants using appropriate algorithms. References g [1] CASAVA. (2012) http://support.illumina.com/sequencing/sequencing software/casava.ilmn. [2] DePristo M. et al, “A framework for variation discovery and genotyping using next-generation DNA sequencing data.” Nature Genetics (2011), 43:491-498. [3] Li H. et al and 1000 Genome Project Data Processing Subgroup, “The Sequence alignment/map (SAM) format and SAMtools.” Bioinformatics (2009), 25: 2078-9. [4] Zaharia M. et al, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.” NSDI (2012). {rxia, ssheehan, yuczhang}@eecs.berkeley.edu

Transcript of BIGGIE: A Distributed Pipeline for Genomic Variant Callingkubitron/courses/... · and use the right...

Page 1: BIGGIE: A Distributed Pipeline for Genomic Variant Callingkubitron/courses/... · and use the right tool for the right region Figure 1:BIGGIE pipeline Per-base SNP caller Main idea:

BIGGIE: A Distributed Pipeline for Genomic Variant CallingRichard Xia1, Sara Sheehan1, Yuchen Zhang1, Ameet Talwalkar1, Matei Zaharia1 Jonathan Terhorst2,

Michael Jordan1,2, Yun S. Song1,2, Armando Fox1, David Patterson1

1 Computer Science Division, UC Berkeley; 2 Department of Statistics, UC Berkeley

Motivation: faster, open-source genome variant calling tools

Impact:Human genome variation is being used more and more to impactdisease diagnosis and treatment, however:

I current tools frequently disagree on variant callsI different types of variation require specialized tools

Current Tools:I GATK [2]: slow and difficult to useI CASAVA [1]: fast, but not freeI samtools mpileup [3]: slow and some accuracy issues

Our Goal:I fast, distributed variant callerI separate the genome into regions of high and low complexity

and use the right tool for the right region Figure 1: BIGGIE pipeline

Per-base SNP caller

Main idea:I Distrubted pipeline for variant calling using Spark [4]I Assign a complexity score to each baseI Use a simple SNP caller at bases with a low complexity scoreI Use more robust structural variant callers at high complexity bases

Complexity region examples:

Figure 2: Different variant calling tools should be used for regions of the genome.

Complexity score features:

Name Weight DescriptionSubstitution 3 Number of aligned reads showing a sub-

stitution with respect to the reference.Insertion 10 Number of aligned reads showing an in-

sertion with respect to the reference.Deletion 10 Number of aligned reads showing a dele-

tion with respect to the reference.Low Quality 3 Number of reads aligned with low map

quality (a common indicator of a repeti-tive region).

Table 1: Relative weight of features for computing complexity.

Results

Simulating data:I Used reads simulated from the consensus sequence for Venter’s genomeI Better approximates the true pattern of SNPs, indels, and structural variants

found in a true genome; reads were aligned using BWA and SNAP

Effect of thresholds:

10-2 10-1 100 101 102

complexity score

0

2000

4000

6000

8000

10000

12000

acc

ura

cy

Accruacy vs. complexity score

false posfalse neg

4 5 6 7 8 9 10percentage of complex bases in region

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

acc

ura

cy

Accruacy vs. percentage complex bases

false posfalse neg

Figure 3: On the left are the per-base results, measuring false negatives only on the regions wecalled. Both accuracy measures increase as the threshold increases, but the number of correctcalls increases as well. We see a similar pattern on the right for the region results, where thenumber of false positives and false negatives increase with the density of complex bases inhigh-complexity regions, but the number of true calls increases as well.

Incorporating high complexity regions

0.8 1.0 1.2 1.4 1.6 1.8 2.0genome position 1e7

0.92

0.94

0.96

0.98

1.00

1.02

1.04

1.06

hig

h c

om

ple

xit

y

High Complexity Regions, chr 21, part 1

2.0 2.2 2.4 2.6 2.8 3.0genome position 1e7

0.92

0.94

0.96

0.98

1.00

1.02

1.04

1.06

hig

h c

om

ple

xit

y

High Complexity Regions, chr 21, part 1

Figure 4: Regions are fairly uniformly distributed, except near the chromosome ends.

I We group bases into a high-complexity region in a greedy fashion,maintaining that the overall high-complexity base density is > t

I We filter out regions that are < 500 bases long

Stats, t = 5%Number of high complexity regions 3603

Percentage of genome is high complexity regions 16.6%

Results g

Timing Results:

Algorithm RuntimeGATK 35m 17s

mpileup 49m 53sBIGGIE 4m 38s

Table 2: Timing results for GATK,mpileup, and BIGGIE. The runtimeis not significantly impacted by thecomplexity threshold.

Low vs. High Complexity:

region type false pos false neg correctlow-complexity 1824 7455 38232high-complexity 2289 2788 13046

Table 3: Our performance degrades in the highcomplexity regions, which is why a special purposevariant caller should be used.

SNP false pos SNP false neg0

2000

4000

6000

8000

10000

12000

14000

16000

BWA+GATKSNAP+GATKBWA+mpileupBWA+BIGGIE,base=0.1BWA+BIGGIE,region=5%

correct SNPs0

10000

20000

30000

40000

50000

60000BWA+GATKSNAP+GATKBWA+mpileupBWA+BIGGIE,base=0.1BWA+BIGGIE,region=5%

Figure 5: Accuracy comparison of BIGGIE with mpileup and GATK. False positives in BIGGIE areoften associated with alignment errors or confusion with a small indel. For each algorithm, a verysmall percentage of correct SNP bases actually have the incorrect (unphased) genotype.

Future Work: Use the high and low complexity regions to distribute the readsacross machines, then call variants using appropriate algorithms.

References g

[1] CASAVA. (2012) http://support.illumina.com/sequencing/sequencing software/casava.ilmn.[2] DePristo M. et al, “A framework for variation discovery and genotyping using next-generation DNA sequencing data.” Nature Genetics(2011), 43:491-498.[3] Li H. et al and 1000 Genome Project Data Processing Subgroup, “The Sequence alignment/map (SAM) format and SAMtools.”Bioinformatics (2009), 25: 2078-9.[4] Zaharia M. et al, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.” NSDI (2012).

{rxia, ssheehan, yuczhang}@eecs.berkeley.edu