Comparison of LUMPY vs. DELLY for structural variant detection

1
A Comparison of Genomic Structural Variant Detection using LUMPY and DELLY Lance Tan 1 , Ronak H. Shah 2 , Michael F. Berger 2 1 Newark Academy, Livingston, NJ 2 Department of Pathology, Memorial Sloan Ke@ering Cancer Center, New York, NY INTRODUCTION METHODS Background Structural variants (SVs), which are deviaGons from normal chromosomal structure affecGng regions approximately 1 kilobase or longer in size, represent one of the largest and most diverse categories of mutaGons to the human genome. As cancer is a disease caused by the accumulaGon of somaGc mutaGons in an individual's genome, structural variants are clearly implicated as a cause of cancer. Recent developments in highthroughput, nextgeneraGon sequencing technology have allowed researchers to sequence large, targeted regions of tumor DNA to locate and treat specific mutaGons; the MSKIMPACT assay (Memorial Sloan Ke@ering Integrated MutaGon Profiling of AcGonable Cancer Targets) and its associated computaGonal pipeline is an example. Despite these recent advances, accurately and efficiently determining the presence and locaGon of SVs from sequencing data remains a cumbersome task due to a number of hurdles: a wide range of SV sizes (less than one kilobase to tens of megabases), mulGple different structural variant types and complexity levels, and different types of SV evidence including pairedend reads (PE), split reads (SR), and read depth (RD). Here, the LUMPY structural variant discovery so[ware is compared with DELLY, its contemporary program in the MSKIMPACT computaGonal pipeline, in order to determine whether integraGng LUMPY into the IMPACT pipeline will be of benefit. Method LUMPY was used to call structural variants on 122 tumornormal sample pairs from 8 sequencing runs for which DELLY had already called SV mutaGons, and the results were compared. SPEEDSEQ, a framework that simplifies and bundles together mulGple tools, including LUMPY and BWAMEM (a sequence aligner), was used to align raw sequencing reads and call structural variants with LUMPY. Python and shell scripts were wri@en to process reads and interface with the components of SPEEDSEQ. All computer processing was done on a computer cluster at MSKCC through the LSF queuing system. Align and process (SPEEDSEQ ALIGN v0.0.3a) • Map reads to human genome (BWAMEM v0.7.8.r455) • Mark duplicates, extract discordant/split reads (SAMBLASTER v0.1.21) • SorGng and indexing (Sambamba v0.4.7) Call SVs (SPEEDSEQ SV) • LUMPY (v0.2.9) run on pairs of tumor/normal samples Filter and annotate • Filter by support, hotspotness, variant size (custom Python script) • Annotate breakpoints (iAnnotateSV v0.0.2) Manually review and compare with DELLY Figure 2: DELLY (Rausch et. al., European Molecular Biology Laboratory, Heidelberg, Germany) is the SV caller currently used in the IMPACT pipeline. Its sequenGal strategy calculates SVcontaining ranges from pairedend reads first and then localizes these ranges using split reads. DELLY contains a modified version of the Gotoh algorithm aligner for split reads idenGficaGon, unlike LUMPY, which must rely on generalized tools. Figure from: Rausch et. al. Bioinforma<cs 28, no. 18 (September 15, 2012): i333–39. Figure 3: LUMPY (Layer et. al., University of Virginia, Charlo@esville, VA) uses a modular framework for detecGng structural variants. It and accounts for mulGple types of evidence in parallel by calculaGng separate breakpoint ranges from each evidence category and then adding these ranges together. In this study, pairedend reads and split reads were used while opGonal copy number variaGon and previously known variants were omi@ed. Figure from: Layer et. al. Genome Biology 15, no. 6 (June 26, 2014): R84. 114 37 17 10 35 20 4 20 14 78 34 17 10 18 17 4 19 7 0 20 40 60 80 100 120 DEL (ROS1EZR) DUP (RETNCOA4) TRA (EWSR1FLI1) TRA (ROS1CD74) INV (RETCCDC6) INV (DIS3DAOA) TRA (FLI1EWSR1) TRA (WT1EWSR1) #1 TRA (WT1EWSR1) #2 PE support (reads) Pairedend support of known structural variants from LUMPY and DELLY LUMPY Tumor PE support DELLY Tumor PE Support 115 40 15 0 37 19 2 26 13 77 47 26 12 43 24 15 29 12 0 20 40 60 80 100 120 DEL (ROS1EZR) DUP (RETNCOA4) TRA (EWSR1FLI1) TRA (ROS1CD74) INV (RETCCDC6) INV (DIS3DAOA) TRA (FLI1EWSR1) TRA (WT1EWSR1) #1 TRA (WT1EWSR1) #2 SR support (reads) Splitread support of known structural variants from LUMPY and DELLY LUMPY Tumor SR support DELLY Tumor SR Support Figure 5: For the nine true structural variant calls made by DELLY, LUMPY's detecGon algorithm tended to detect more pairedend reads than DELLY. This is an advantage to using LUMPY over DELLY since pairedend reads increase confidence that a variant is real. Figure 6: DELLY generally finds more splitread support for the same nine mutaGons. Since split reads uniquely of all evidence types allow localizaGon of a breakpoint to the exact base, this presents a significant advantage over LUMPY. This difference may be that DELLY uses a modified version of the Gotoh algorithm to align sequences with kmers to find split reads, whereas LUMPY must rely on split read support found by BWAMEM. LUMPY calls gain pairedend support but lose splitread support compared to exisLng DELLY calls. Figure 1: Categories of geneGc structural variaGon. LUMPY and DELLY both group variants broadly into deleGons, inserGons, duplicaGons and translocaGons. In addiGon to these categories, mulGple structural changes can occur in overlapping regions, creaGng complex and hardtocategorize mutaGons. Figure from: Alkan et. al. Nature Reviews Gene<cs 12, no. 5 (May 2011): 363–76. CONCLUSION •While LUMPY detects more pairedend read support than DELLY for the same mutaGons, it also detects less split read support. •LUMPY exhibits a significant bias towards calling deleGons and inversions, the majority of which are false posiGves that distract manual reviewers from significant SVs. •LUMPY detects the majority of variants that DELLY does. DELLY in the IMPACT pipeline's implementaGon detected more variants that LUMPY in this study's implementaGon did not than vice versa. Overall, replacing DELLY with LUMPY as the structural variant detector in the MSKIMPACT pipeline would not produce much benefit. Whether a combined approach including LUMPY as a supplement to DELLY is more effecGve remains to be seen. ACKNOWLEDGEMENTS I would like to thank Ronak Shah and Dr. Michael Berger for all of their support and instrucGon in developing this project and seeing it to compleGon. Figure 4: (A) A ROS1EZR deleGon, (B) a ROS1CD74 translocaGon, (C) a RETNCOA4 duplicaGon, and (D) a RETCCDC6 inversion called by LUMPY. All four structural variants are also true posiGves called by DELLY. The different orientaGons of pairedend reads allow variant categorizaGon into one of these four classes. Like other structural variant callers, LUMPY is imperfect and may not detect all the evidence that supports a mutaGon, such as in (B). 0 200 400 600 800 1000 10 20 30 40 50 60 70 80 90 100 Number of SVs Breakpoint difference threshold (bases) Common and unique SVs with varying breakpoint difference threshold SVs found by both SVs found by LUMPY only SVs found by DELLY only 685 1025 95 DELLY calls (1120 total) LUMPY calls (780 total) LUMPY trends towards calling deleLons and inversions over duplicaLons and translocaLons. DEL, 208711, 90.62% INV, 16987, 7.38% DUP,656, 0.28% TRA, 3951, 1.72% Unfiltered LUMPY calls DEL, 2689, 52.63% INV, 2366, 46.31% DUP, 17, 0.33% TRA, 37, 0.72% Filtered LUMPY calls DEL, 103, 48.82% INV, 70, 33.18% DUP, 9, 4.27% TRA, 29, 13.74% Filtered LUMPY calls w/o outliers Figure 8: The breakdown of calls made by LUMPY for each structural variant type before and a[er filtering. (A) DistribuGon of calls before filtering. (B) DistribuGon a[er filtering by support. A more lenient filter was applied to mutaGon calls in hotspot regions than to nonhotspot regions. (C) The distribuGon a[er filtering and removing outlier samples (all samples with more than 20 postfilter variant calls). These sixteen samples account for a disproporGonate 96.2% of postfilter deleGons and 97.0% of postfilter inversions. A B C A C B D Figure 7: (A) Comparison of the common and unique calls made by LUMPY and DELLY at various breakpoint difference thresholds. Since LUMPY and DELLY may not necessarily resolve the breakpoints of an SV to the same base, we defined a threshold difference between two equivalent LUMPY and DELLY calls: two calls with an absolute difference lower than this threshold were considered equivalent. The ideal threshold would be the minimum that produces a maximum number of equivalent SVs. Since this ideal could not be determined, a threshold of 10 was chosen based on the absolute difference of breakpoints in true posiGve calls. (B) The number of common and unique calls at a threshold of 10. Both figures demonstrate that with the currently used pipeline serngs DELLY calls a significantly larger number of SVs than LUMPY. A B RESULTS Examples of true structural variants found by LUMPY. SR (115) PE (114) All reads ROS1 EZR SR (0) PE (12) All reads CD74 ROS1 SR (40) PE (37) All reads RET NCOA4 SR (47) PE (35) All reads RET CCDC6 The current DELLYbased computaLonal pipeline idenLfied more SVs than LUMPY.

Transcript of Comparison of LUMPY vs. DELLY for structural variant detection

Page 1: Comparison of LUMPY vs. DELLY for structural variant detection

A  Comparison  of  Genomic  Structural  Variant  Detection  using  LUMPY  and  DELLY  

Lance  Tan1,  Ronak  H.  Shah2,  Michael  F.  Berger2    1Newark  Academy,  Livingston,  NJ  2Department  of  Pathology,  Memorial  Sloan  Ke@ering  Cancer  Center,  New  York,  NY    

INTRODUCTION  

METHODS  

Background  

Structural  variants  (SVs),  which  are  deviaGons  from  normal  chromosomal  structure  affecGng  regions  approximately  1  kilobase  or  longer  in  size,  represent  one  of  the  largest  and  most  diverse  categories  of  mutaGons  to  the  human  genome.  As  cancer  is  a  disease  caused  by  the  accumulaGon  of  somaGc  mutaGons  in  an  individual's  genome,  structural  variants  are  clearly  implicated  as  a  cause  of  cancer.  Recent  developments  in  high-­‐throughput,  next-­‐generaGon  sequencing  technology  have  allowed  researchers  to  sequence  large,  targeted  regions  of  tumor  DNA  to  locate  and  treat  specific  mutaGons;  the  MSK-­‐IMPACT  assay  (Memorial  Sloan  Ke@ering  -­‐  Integrated  MutaGon  Profiling  of  AcGonable  Cancer  Targets)  and  its  associated  computaGonal  pipeline  is  an  example.  Despite  these  recent  advances,  accurately  and  efficiently  determining  the  presence  and  locaGon  of  SVs  from  sequencing  data  remains  a  cumbersome  task  due  to  a  number  of  hurdles:  a  wide  range  of  SV  sizes  (less  than  one  kilobase  to  tens  of  megabases),  mulGple  different  structural  variant  types  and  complexity  levels,  and  different  types  of  SV  evidence  including  paired-­‐end  reads  (PE),  split  reads  (SR),  and  read  depth  (RD).  Here,  the  LUMPY  structural  variant  discovery  so[ware  is  compared  with  DELLY,  its  contemporary  program  in  the  MSK-­‐IMPACT  computaGonal  pipeline,  in  order  to  determine  whether  integraGng  LUMPY  into  the  IMPACT  pipeline  will  be  of  benefit.    

Method  

LUMPY  was  used  to  call  structural  variants  on  122  tumor-­‐normal  sample  pairs  from  8  sequencing  runs  for  which  DELLY  had  already  called  SV  mutaGons,  and  the  results  were  compared.  SPEEDSEQ,  a  framework  that  simplifies  and  bundles  together  mulGple  tools,  including  LUMPY  and  BWA-­‐MEM  (a  sequence  aligner),  was  used  to  align  raw  sequencing  reads  and  call  structural  variants  with  LUMPY.  Python  and  shell  scripts  were  wri@en  to  process  reads  and  interface  with  the  components  of  SPEEDSEQ.  All  computer  processing  was  done  on  a  computer  cluster  at  MSKCC  through  the  LSF  queuing  system.  

Align  and  process  (SPEEDSEQ  ALIGN  v0.0.3a)  • Map  reads  to  human  genome  (BWA-­‐MEM  v0.7.8.r455)  • Mark  duplicates,  extract  discordant/split  reads  (SAMBLASTER  v0.1.21)  • SorGng  and  indexing  (Sambamba  v0.4.7)  

Call  SVs  (SPEEDSEQ  SV)  • LUMPY  (v0.2.9)  run  on  pairs  of  tumor/normal  samples  

Filter  and  annotate  • Filter  by  support,  hotspotness,  variant  size  (custom  Python  script)  • Annotate  breakpoints  (iAnnotateSV  v0.0.2)  

Manually  review  and  compare  with  DELLY  

Figure  2:  DELLY  (Rausch  et.  al.,  European  Molecular  Biology  Laboratory,  Heidelberg,  Germany)  is  the  SV  caller  currently  used  in  the  IMPACT  pipeline.  Its  sequenGal  strategy  calculates  SV-­‐containing  ranges  from  paired-­‐end  reads  first  and  then  localizes  these  ranges  using  split  reads.  DELLY  contains  a  modified  version  of  the  Gotoh  algorithm  aligner  for  split  reads  idenGficaGon,  unlike  LUMPY,  which  must  rely  on  generalized  tools.    Figure  from:  Rausch  et.  al.  Bioinforma<cs  28,  no.  18  (September  15,  2012):  i333–39.  

Figure  3:  LUMPY  (Layer  et.  al.,  University  of  Virginia,  Charlo@esville,  VA)  uses  a  modular  framework  for  detecGng  structural  variants.  It  and  accounts  for  mulGple  types  of  evidence  in  parallel  by  calculaGng  separate  breakpoint  ranges  from  each  evidence  category  and  then  adding  these  ranges  together.  In  this  study,  paired-­‐end  reads  and  split  reads  were  used  while  opGonal  copy  number  variaGon  and  previously  known  variants  were  omi@ed.  Figure  from:  Layer  et.  al.  Genome  Biology  15,  no.  6  (June  26,  2014):  R84.    

114  

37  

17  10  

35  

20  

4  

20  14  

78  

34  

17  10  

18   17  

4  

19  

7  

0  

20  

40  

60  

80  

100  

120  

DEL  (ROS1-­‐EZR)   DUP  (RET-­‐NCOA4)   TRA  (EWSR1-­‐FLI1)   TRA  (ROS1-­‐CD74)   INV  (RET-­‐CCDC6)   INV  (DIS3-­‐DAOA)   TRA  (FLI1-­‐EWSR1)   TRA  (WT1-­‐EWSR1)  #1   TRA  (WT1-­‐EWSR1)  #2  

PE  su

pport  (read

s)  

Paired-­‐end  support  of  known  structural  variants  from  LUMPY  and  DELLY  

LUMPY  Tumor    PE  support   DELLY  Tumor  PE  Support  

115  

40  

15  

0  

37  

19  

2  

26  

13  

77  

47  

26  

12  

43  

24  

15  

29  

12  

0  

20  

40  

60  

80  

100  

120  

DEL  (ROS1-­‐EZR)   DUP  (RET-­‐NCOA4)   TRA  (EWSR1-­‐FLI1)   TRA  (ROS1-­‐CD74)   INV  (RET-­‐CCDC6)   INV  (DIS3-­‐DAOA)   TRA  (FLI1-­‐EWSR1)   TRA  (WT1-­‐EWSR1)  #1   TRA  (WT1-­‐EWSR1)  #2  

SR  su

pport  (read

s)  

Split-­‐read  support  of  known  structural  variants  from  LUMPY  and  DELLY  

LUMPY  Tumor  SR  support   DELLY  Tumor  SR  Support  

Figure  5:  For  the  nine  true  structural  variant  calls  made  by  DELLY,  LUMPY's  detecGon  algorithm  tended  to  detect  more  paired-­‐end  reads  than  DELLY.  This  is  an  advantage  to  using  LUMPY  over  DELLY  since  paired-­‐end  reads  increase  confidence  that  a  variant  is  real.  

Figure  6:  DELLY  generally  finds  more  split-­‐read  support  for  the  same  nine  mutaGons.  Since  split  reads  uniquely  of  all  evidence  types  allow  localizaGon  of  a  breakpoint  to  the  exact  base,  this  presents  a  significant  advantage  over  LUMPY.  This  difference  may  be  that  DELLY  uses  a  modified  version  of  the  Gotoh  algorithm  to  align  sequences  with  k-­‐mers  to  find  split  reads,  whereas  LUMPY  must  rely  on  split  read  support  found  by  BWA-­‐MEM.  

LUMPY  calls  gain  paired-­‐end  support  but  lose  split-­‐read  support  compared  to  exisLng  DELLY  calls.  

Figure  1:  Categories  of  geneGc  structural  variaGon.  LUMPY  and  DELLY  both  group  variants  broadly  into  deleGons,  inserGons,  duplicaGons  and  translocaGons.  In  addiGon  to  these  categories,  mulGple  structural  changes  can  occur  in  overlapping  regions,  creaGng  complex  and  hard-­‐to-­‐categorize  mutaGons.    Figure  from:  Alkan  et.  al.  Nature  Reviews  Gene<cs  12,  no.  5  (May  2011):  363–76.    

CONCLUSION  • While  LUMPY  detects  more  paired-­‐end  read  support  than  DELLY  for  the  same  mutaGons,  it  also  detects  less  split  read  support.  

• LUMPY  exhibits  a  significant  bias  towards  calling  deleGons  and  inversions,  the  majority  of  which  are  false  posiGves  that  distract  

manual  reviewers  from  significant  SVs.  

• LUMPY  detects  the  majority  of  variants  that  DELLY  does.  DELLY  in  the  IMPACT  pipeline's  implementaGon  detected  more  

variants  that  LUMPY  in  this  study's  implementaGon  did  not  than  vice  versa.  

 

Overall,  replacing  DELLY  with  LUMPY  as  the  structural  variant  detector  in  the  MSK-­‐IMPACT  pipeline  would  not  produce  much  

benefit.  Whether  a  combined  approach  including  LUMPY  as  a  supplement  to  DELLY  is  more  effecGve  remains  to  be  seen.  

ACKNOWLEDGEMENTS    

I  would  like  to  thank  Ronak  Shah  and  Dr.  Michael  Berger  for  all  of  their  support  and  instrucGon  in  developing  this  project  and  seeing  it  to  compleGon.  

Figure  4:  (A)  A  ROS1-­‐EZR  deleGon,  (B)  a  ROS1-­‐CD74  translocaGon,  (C)  a  RET-­‐NCOA4  duplicaGon,  and  (D)  a  RET-­‐CCDC6  inversion  called  by  LUMPY.  All  four  structural  variants  are  also  true  posiGves  called  by  DELLY.  The  different  orientaGons  of  paired-­‐end  reads  allow  variant  categorizaGon  into  one  of  these  four  classes.  Like  other  structural  variant  callers,  LUMPY  is  imperfect  and  may  not  detect  all  the  evidence  that  supports  a  mutaGon,  such  as  in  (B).  

0  

200  

400  

600  

800  

1000  

10   20   30   40   50   60   70   80   90   100  

Num

ber  o

f  SVs  

Breakpoint  difference  threshold  (bases)  

Common  and  unique  SVs  with  varying  breakpoint  difference  threshold  

SVs  found  by  both   SVs  found  by  LUMPY  only   SVs  found  by  DELLY  only  

685   1025  95  

DELLY  calls    (1120  total)  

LUMPY  calls    (780  total)  

LUMPY  trends  towards  calling  deleLons  and  inversions  over  duplicaLons  and  translocaLons.  

DEL,  208711,  90.62%  

INV,  16987,  7.38%  

DUP,656,  0.28%  

TRA,  3951,  1.72%  

Unfiltered  LUMPY  calls  

DEL,  2689,  52.63%  

INV,  2366,  46.31%  

DUP,  17,  0.33%  

TRA,  37,  0.72%  

Filtered  LUMPY  calls  

DEL,  103,  48.82%  

INV,  70,  33.18%  

DUP,  9,  4.27%  

TRA,  29,  13.74%  

Filtered  LUMPY  calls  w/o  outliers  

Figure  8:  The  breakdown  of  calls  made  by  LUMPY  for  each  structural  variant  type  before  and  a[er  filtering.  (A)  DistribuGon  of  calls  before  filtering.  (B)  DistribuGon  a[er  filtering  by  support.  A  more  lenient  filter  was  applied  to  mutaGon  calls  in  hotspot  regions  than  to  non-­‐hotspot  regions.  (C)  The  distribuGon  a[er  filtering  and  removing  outlier  samples  (all  samples  with  more  than  20  post-­‐filter  variant  calls).  These  sixteen  samples  account  for  a  disproporGonate  96.2%  of  post-­‐filter  deleGons  and  97.0%  of  post-­‐filter  inversions.    

A   B   C  

A  

C  

B  

D  

Figure  7:  (A)  Comparison  of  the  common  and  unique  calls  made  by  LUMPY  and  DELLY  at  various  breakpoint  difference  thresholds.  Since  LUMPY  and  DELLY  may  not  necessarily  resolve  the  breakpoints  of  an  SV  to  the  same  base,  we  defined  a  threshold  difference  between  two  equivalent  LUMPY  and  DELLY  calls:  two  calls  with  an  absolute  difference  lower  than  this  threshold  were  considered  equivalent.  The  ideal  threshold  would  be  the  minimum  that  produces  a  maximum  number  of  equivalent  SVs.  Since  this  ideal  could  not  be  determined,  a  threshold  of  10  was  chosen  based  on  the  absolute  difference  of  breakpoints  in  true  posiGve  calls.  (B)  The  number  of  common  and  unique  calls  at  a  threshold  of  10.  Both  figures  demonstrate  that  with  the  currently  used  pipeline  serngs  DELLY  calls  a  significantly  larger  number  of  SVs  than  LUMPY.  

A   B  RESULTS  Examples  of  true  structural  variants  found  by  LUMPY.  

SR  (115)  

PE  (114)  

All  reads  

ROS1   EZR  

SR  (0)  

PE  (12)  

All  reads  

CD74   ROS1  

SR  (40)  

PE  (37)  

All  reads  

RET   NCOA4  

SR  (47)  

PE  (35)  

All  reads  

RET   CCDC6  

The  current  DELLY-­‐based  computaLonal  pipeline  idenLfied  more  SVs  than  LUMPY.