BIGGIE: A Distributed Pipeline for Genomic Variant Callingkubitron/courses/... · and use the right...

BIGGIE: A Distributed Pipeline for Genomic Variant CallingRichard Xia1, Sara Sheehan1, Yuchen Zhang1, Ameet Talwalkar1, Matei Zaharia1 Jonathan Terhorst2,

Michael Jordan1,2, Yun S. Song1,2, Armando Fox1, David Patterson1

1 Computer Science Division, UC Berkeley; 2 Department of Statistics, UC Berkeley

Motivation: faster, open-source genome variant calling tools

Impact:Human genome variation is being used more and more to impactdisease diagnosis and treatment, however:

I current tools frequently disagree on variant callsI different types of variation require specialized tools

Current Tools:I GATK [2]: slow and difficult to useI CASAVA [1]: fast, but not freeI samtools mpileup [3]: slow and some accuracy issues

Our Goal:I fast, distributed variant callerI separate the genome into regions of high and low complexity

and use the right tool for the right region Figure 1: BIGGIE pipeline

Per-base SNP caller

Main idea:I Distrubted pipeline for variant calling using Spark [4]I Assign a complexity score to each baseI Use a simple SNP caller at bases with a low complexity scoreI Use more robust structural variant callers at high complexity bases

Complexity region examples:

Figure 2: Different variant calling tools should be used for regions of the genome.

Complexity score features:

Name Weight DescriptionSubstitution 3 Number of aligned reads showing a sub-

stitution with respect to the reference.Insertion 10 Number of aligned reads showing an in-

sertion with respect to the reference.Deletion 10 Number of aligned reads showing a dele-

tion with respect to the reference.Low Quality 3 Number of reads aligned with low map

quality (a common indicator of a repeti-tive region).

Table 1: Relative weight of features for computing complexity.

Results

Simulating data:I Used reads simulated from the consensus sequence for Venter’s genomeI Better approximates the true pattern of SNPs, indels, and structural variants

found in a true genome; reads were aligned using BWA and SNAP

Effect of thresholds:

10-2 10-1 100 101 102

complexity score

0

2000

4000

6000

8000

10000

12000

acc

ura

cy

Accruacy vs. complexity score

false posfalse neg

4 5 6 7 8 9 10percentage of complex bases in region

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

acc

ura

cy

Accruacy vs. percentage complex bases

false posfalse neg

Figure 3: On the left are the per-base results, measuring false negatives only on the regions wecalled. Both accuracy measures increase as the threshold increases, but the number of correctcalls increases as well. We see a similar pattern on the right for the region results, where thenumber of false positives and false negatives increase with the density of complex bases inhigh-complexity regions, but the number of true calls increases as well.

Incorporating high complexity regions

0.8 1.0 1.2 1.4 1.6 1.8 2.0genome position 1e7

0.92

0.94

0.96

0.98

1.00

1.02

1.04

1.06

hig

h c

om

ple

xit

y

High Complexity Regions, chr 21, part 1

2.0 2.2 2.4 2.6 2.8 3.0genome position 1e7

0.92

0.94

0.96

0.98

1.00

1.02

1.04

1.06

hig

h c

om

ple

xit

y

High Complexity Regions, chr 21, part 1

Figure 4: Regions are fairly uniformly distributed, except near the chromosome ends.

I We group bases into a high-complexity region in a greedy fashion,maintaining that the overall high-complexity base density is > t

I We filter out regions that are < 500 bases long

Stats, t = 5%Number of high complexity regions 3603

Percentage of genome is high complexity regions 16.6%

Results g

Timing Results:

Algorithm RuntimeGATK 35m 17s

mpileup 49m 53sBIGGIE 4m 38s

Table 2: Timing results for GATK,mpileup, and BIGGIE. The runtimeis not significantly impacted by thecomplexity threshold.

Low vs. High Complexity:

region type false pos false neg correctlow-complexity 1824 7455 38232high-complexity 2289 2788 13046

Table 3: Our performance degrades in the highcomplexity regions, which is why a special purposevariant caller should be used.

SNP false pos SNP false neg0

2000

4000

6000

8000

10000

12000

14000

16000

BWA+GATKSNAP+GATKBWA+mpileupBWA+BIGGIE,base=0.1BWA+BIGGIE,region=5%

correct SNPs0

10000

20000

30000

40000

50000

60000BWA+GATKSNAP+GATKBWA+mpileupBWA+BIGGIE,base=0.1BWA+BIGGIE,region=5%

Figure 5: Accuracy comparison of BIGGIE with mpileup and GATK. False positives in BIGGIE areoften associated with alignment errors or confusion with a small indel. For each algorithm, a verysmall percentage of correct SNP bases actually have the incorrect (unphased) genotype.

Future Work: Use the high and low complexity regions to distribute the readsacross machines, then call variants using appropriate algorithms.

References g

[1] CASAVA. (2012) http://support.illumina.com/sequencing/sequencing software/casava.ilmn.[2] DePristo M. et al, “A framework for variation discovery and genotyping using next-generation DNA sequencing data.” Nature Genetics(2011), 43:491-498.[3] Li H. et al and 1000 Genome Project Data Processing Subgroup, “The Sequence alignment/map (SAM) format and SAMtools.”Bioinformatics (2009), 25: 2078-9.[4] Zaharia M. et al, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.” NSDI (2012).

{rxia, ssheehan, yuczhang}@eecs.berkeley.edu

BIGGIE: A Distributed Pipeline for Genomic Variant Callingkubitron/courses/... · and use the right...

Documents

Transcript of BIGGIE: A Distributed Pipeline for Genomic Variant Callingkubitron/courses/... · and use the right...