Research Article Long Read Alignment with Parallel...

14
Research Article Long Read Alignment with Parallel MapReduce Cloud Platform Ahmed Abdulhakim Al-Absi 1 and Dae-Ki Kang 2 1 Department of Ubiquitous IT, Graduate School, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan 47011, Republic of Korea 2 Department of Computer & Information Engineering, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan 47011, Republic of Korea Correspondence should be addressed to Dae-Ki Kang; [email protected] Received 22 September 2015; Revised 18 November 2015; Accepted 18 November 2015 Academic Editor: Elaine Leung Copyright © 2015 A. A. Al-Absi and D.-K. Kang. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics. Next-Generation Sequencing technologies produce genomic data of longer reads. Cloud platforms are adopted to address the problems arising from storage and analysis of large genomic data. Existing genes sequencing tools for cloud platforms predominantly consider short read gene sequences and adopt the Hadoop MapReduce framework for computation. However, serial execution of map and reduce phases is a problem in such systems. erefore, in this paper, we introduce Burrows-Wheeler Aligner’s Smith-Waterman Alignment on Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignment. e proposed cloud platform adopts a widely accepted and accurate BWA-SW algorithm for long sequence alignment. A custom MapReduce platform is developed to overcome the drawbacks of the Hadoop framework. A parallel execution strategy of the MapReduce phases and optimization of Smith- Waterman algorithm are considered. Performance evaluation results exhibit an average speed-up of 6.7 considering BWASW-PMR compared with the state-of-the-art Bwasw-Cloud. An average reduction of 30% in the map phase makespan is reported across all experiments comparing BWASW-PMR with Bwasw-Cloud. Optimization of Smith-Waterman results in reducing the execution time by 91.8%. e experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloud platforms. 1. Introduction Bioinformatics involves the biological, genomic, statistics, mathematics, and computer science disciplines of study. Analysis of genomic and biological sequence data is one of the important parts of bioinformatics [1]. e analysis of genomic sequences enables us to understand the genetic structure, and its bases of drug response and disease. Numerous researchers who are working in this interdisciplinary field perform gene structure and functionality analysis in order to discover new gene sequences and understand gene origin. For the analysis purpose, existing genomic databases such as Google Genomics [2] and NCBI [3] (generally in 100’s of GB in size) are used to identify the similarities. Identification of sim- ilarities/dissimilarities is achieved by sequence comparison algorithms. Comparisons of biological sequences produce matching alignments and similarity scores. ese similar- ity scores represent the similarities/dissimilarities between the considered biological sequences. e matching align- ments and similarity scores are used for secondary structure predictions and multiple sequence alignments which are highly complex operations that rely on the accuracy of the comparison algorithm used. Applications related to cancer research, forensics, agrigenomics, genetic disease identifica- tion, microbial research, reproductive health, human whole- genome sequencing, and many more rely on sequence align- ment algorithms for analysis. Genomic sequencing data is obtained from the devel- oped Next-Generation Sequencing (NGS) technologies (e.g., from Illumina, Solexa, and Pacific Bioscience). Most of the genomic data available consists of millions of genomic sequences of short reads [4]. With the advances in sequencing technologies, millions of sequences with greater read lengths > 1000 bp are now being generated. Based on genomic data, the sequencing tools can be basically classified into two cat- egories: Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 807407, 13 pages http://dx.doi.org/10.1155/2015/807407

Transcript of Research Article Long Read Alignment with Parallel...

Page 1: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

Research ArticleLong Read Alignment with Parallel MapReduce Cloud Platform

Ahmed Abdulhakim Al-Absi1 and Dae-Ki Kang2

1Department of Ubiquitous IT Graduate School Dongseo University 47 Jurye-ro Sasang-gu Busan 47011 Republic of Korea2Department of Computer amp Information Engineering Dongseo University 47 Jurye-ro Sasang-gu Busan 47011 Republic of Korea

Correspondence should be addressed to Dae-Ki Kang dkkangdongseoackr

Received 22 September 2015 Revised 18 November 2015 Accepted 18 November 2015

Academic Editor Elaine Leung

Copyright copy 2015 A A Al-Absi and D-K KangThis is an open access article distributed under theCreativeCommonsAttributionLicense which permits unrestricted use distribution and reproduction in anymedium provided the originalwork is properly cited

Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics Next-GenerationSequencing technologies produce genomic data of longer reads Cloud platforms are adopted to address the problems arising fromstorage and analysis of large genomic data Existing genes sequencing tools for cloud platforms predominantly consider shortread gene sequences and adopt the HadoopMapReduce framework for computation However serial execution of map and reducephases is a problem in such systemsTherefore in this paper we introduce Burrows-Wheeler Alignerrsquos Smith-Waterman Alignmenton Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignmentThe proposed cloud platform adopts a widelyaccepted and accurate BWA-SW algorithm for long sequence alignment A customMapReduce platform is developed to overcomethe drawbacks of the Hadoop framework A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered Performance evaluation results exhibit an average speed-up of 67 considering BWASW-PMRcompared with the state-of-the-art Bwasw-Cloud An average reduction of 30 in the map phase makespan is reported across allexperiments comparing BWASW-PMR with Bwasw-Cloud Optimization of Smith-Waterman results in reducing the executiontime by 918 The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloudplatforms

1 Introduction

Bioinformatics involves the biological genomic statisticsmathematics and computer science disciplines of studyAnalysis of genomic and biological sequence data is one of theimportant parts of bioinformatics [1]The analysis of genomicsequences enables us to understand the genetic structure andits bases of drug response and disease Numerous researcherswho are working in this interdisciplinary field perform genestructure and functionality analysis in order to discovernew gene sequences and understand gene origin For theanalysis purpose existing genomic databases such as GoogleGenomics [2] and NCBI [3] (generally in 100rsquos of GB in size)are used to identify the similarities Identification of sim-ilaritiesdissimilarities is achieved by sequence comparisonalgorithms Comparisons of biological sequences producematching alignments and similarity scores These similar-ity scores represent the similaritiesdissimilarities between

the considered biological sequences The matching align-ments and similarity scores are used for secondary structurepredictions and multiple sequence alignments which arehighly complex operations that rely on the accuracy of thecomparison algorithm used Applications related to cancerresearch forensics agrigenomics genetic disease identifica-tion microbial research reproductive health human whole-genome sequencing and many more rely on sequence align-ment algorithms for analysis

Genomic sequencing data is obtained from the devel-oped Next-Generation Sequencing (NGS) technologies (egfrom Illumina Solexa and Pacific Bioscience) Most ofthe genomic data available consists of millions of genomicsequences of short reads [4]With the advances in sequencingtechnologies millions of sequences with greater read lengthsgt 1000 bp are now being generated Based on genomic datathe sequencing tools can be basically classified into two cat-egories

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 807407 13 pageshttpdxdoiorg1011552015807407

2 BioMed Research International

(1) Short read aligners used to align genomic sequenceswhose read length is between 32 bp and 200 bp thatis BWA [4] and Bowtie2 [5]

(2) Long read aligners used to align genomic sequenceswhose length is greater than 1000 bp that is Illumina

Examples of short read aligners include FASTA [6]BLAST [7] BLAT [8] SOAP [9] and Burrows-WheelerAligner (BWA) [4] For long read alignment researchershave proposed Burrows-Wheeler Alignerrsquos Smith-Waterman(BWA-SW) Alignment [10] Bowtie2 [5] Cushaw2 [11] andBWA-MEM [12]

Existing bioinformatics applications require both man-agement of huge amounts of data and heavy computationfor analysis Nearly 3 billion US dollars and 10 years wererequired to produce the initial human reference genomecontaining about 35 billion base pairs The latest develop-ments in sequencing technologies available produce enor-mous amount of data in terms of gigabytes per run [13]In NGS large sample size applications users are required towait for sufficient resources to become available in which thetime needed to complete processing becomes unpredictableAnalyzing such enormous amount of data and its storageare problems that exist To address these high computa-tion needs grid or cluster computing platforms have beenprovided to researchers [14] The provided grid or clustercomputation platforms are constrained by hardware capacityand concurrent access capacity to support multiple usersThe ever growing gap between computing capabilities andsequencing throughput is presented in [15]

Using cloud platforms one can solve the storage andcomputing problems that exist in gene sequencing Cloudcomputing platforms provide flexibility scalability and on-demand access to resources Moreover adoption of cloudcomputing technologies eliminates the costs incurred inestablishing maintaining large physical storage systemsand computing clusters Cloud users pay for the uti-lized storage and computing resources without worryingabout maintenance availability and reliability related issuesAlthough cloud computing provides scalable and flexibleinfrastructure parallel computing models on cloud plat-formsinfrastructure are required to achieve the desired goalof analyzing gene sequences One such framework for thecloud called MapReduce model [16] was introduced byGoogle Another model developed by University of Califor-nia at Berkeley proposed the Spark platform [17] for cloudcomputing The Pregel framework for cloud environmentsis presented in [18] A comprehensive survey of most recentresearch and development in cloud-based service computingsolutions in research of genome informatics is available in[19 20] In research of genomic analysis there are a numberof cloud computing providers that offer cloud-based servicessolutions such as Google Cloud Genomics [3] Amazon [21]andMicrosoft Azure [22]The necessity for cloud computingfor genomic analysis has been well discussed by leadersin bioinformatics and computational biology [23] Galaxy[24] is a powerful web-based system for performing genomeanalysis tasks and one of many models that manage bioinfor-matics databases Galaxy built on Amazonrsquos Elastic Compute

Cloud (EC2) with CloudMan framework to assist researchersexecutes their own Galaxy server on cloud infrastructureHowever Galaxy has data storage and data manipulationbottlenecks for large datasets with capability of analyzingone sample at a time and does not utilize the elastic cloudcompute capabilities completely This drawback is caused byits dependence on a single shared file system A significantbottleneck has been reported in [25] when processing largedatasets across distributed compute resources Among all thecloud computing platforms available Hadoop MapReduce isby far the most popular choice due to its ease to tune opensource nature and acceptability by industry and academicorganizations In order to utilize the full potential of cloudinfrastructure public cloud service providers offer virtualizedresources in terms of both hardware and software [26] Thevirtualization enables user specific customization flexibilityapplication execution environments and cost efficiency andminimizes power consumption [27 28]

The Hadoop framework [29] adopts a MapReduce modelfor computing a user specific application on cloud platformsIn the MapReduce model a dual phase execution approachis adopted In the initial phase input data to be processed issplit into chunks Each chunk is associated with a mapper ora map worker that provides ⟨Key | Value⟩ pairs as outputsThe outputs values are sorted on the basis of the Key valuesassociatedThe sorted values are provided to reduce workersthat is ⟨Key | SortedList(Value)⟩ Reduce workers store theresults inHadoop distributed file systemThemap and reduceworkers are generally virtual machines (VMs) in public cloudenvironments A simple MapReduce model deployed on theVM based computing environment is shown in Figure 1

Next-Generation Sequencing (NGS) tools like CloudA-ligner [13] Cloud Burst [30] SeqMapReduce [31] and Cross-bow [32] adopt the Hadoop frameworkThemajor drawbackof these alignment tools is that they are short read alignersThe short read aligners prove to be efficient when single-gapor ungapped alignment is to be computed Considering longreads these alignment algorithms exhibit performance deg-radation and affect accuracy proved by the results presentedin [10] For alignment considering long sequence readsoptimization of the BLAST algorithm and its deployment onHadoop platform is presented in [33] For long read alignersthe BWA-SW algorithm in [10] has been found to be efficientand suitable In [34] the Bwasw-Cloud algorithm consider-ing Hadoop platform is presented The Bwasw-Cloud adoptsBWA-SW algorithm for alignment Based on the literaturereviewed it is evident that limited work has been carried outconsidering long read aligners required to support analysis ofgenomic data generated from current and future sequencingtechnologies There is an increasingly urgent need to pro-vide a reliable and scalable support for the ever-increasingconvergence of NGS data However there is clearly a lackof standardized and affordable NGS management solutionson the cloud to support the growing needs of translationalgenomics research [20] This is a major motivating factor forthe authors of this paper

All the existing long read aligners for cloud environmentsconsider the Hadoop framework Genome alignment is an

BioMed Research International 3

Split

Merge

MergeMerge

Inpu

t

Out

put

Map

Map

Partition

Partition

Reduce

Reduce

Worker 1 on VM Worker 1 on VM

Worker N on VM Worker M on VM

Map stage Reduce stage

Cloud storage

Cloud storage

VM based virtualized computing environment

Cloud storage

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 MapReduce model deployed using VM based computing environment

iterative process and Hadoop incurs considerable perfor-mance overheads when iterative applications are hosted onthe framework [18 35]The aligners presented in [33 34] needto rely on multiway joins as genomic sequences are split andneeded to be merged prior to the reduce phase The perfor-mance of Hadoop suffers whenmultiway joins are considered[36]TheHadoop framework considers sequential processingof the map followed by reduce stage that affects performance[37]

Therefore in this paper we present Burrows-WheelerAlignerrsquos Smith-Waterman Alignment on Parallel MapRe-duce (BWASW-PMR) to perform long read alignments ona cloud platform BWASW-PMR adopts the BWA-SW pre-sented in [10] to perform long read alignments of genomicdata MapReduce model is considered for the execution onthe public cloud environment The Smith-Waterman (SW)algorithm [38 39] in BWA-SW is optimized using paral-lel computation technique To overcome the drawbacks ofHadoopMapReduce a parallel execution strategy of the mapand reduce workers is considered in BWASW-PMRThemapand reduce functions executions on the VM based workernodes are parallelized to reduce execution time The alignerpresented in [34] bears the closest similarity to the BWASW-PMR proposed and is considered for comparison The majorcontribution of our work can be summarized here

(i) Optimization of Smith-Waterman in the BWA-SWlong read sequence alignment

(ii) A custom MapReduce framework to support therequired computations for long sequence alignment

(iii) Parallel map and reduce workers execution strategy

(iv) Parallel execution of the map and reduce functions atworker nodes

This paper is organized as follows Section 2 presents relatedwork background Section 3 discusses the proposed Burrows-Wheeler Alignerrsquos Smith-Waterman Alignment on the Par-allel MapReduce BWASW-PMR The SW algorithm and itsconsidered optimization are also presented in Section 3 Theresults and the experimental study are shown in Section 4Comparison notes with related work counterparts are dis-cussed in the penultimate section The concluding remarksand future work are discussed in the last section

2 Background

21 Smith-Waterman (SW) Algorithm Burrows-WheelerAlignerrsquos Smith-Waterman (BWA-SW) Alignment relies onSW algorithm to align the seed matches of sequences Let 119902

0

and 1199021represent two genomic sequences obtained from the

seed matches The Smith-Waterman algorithm computes thesimilarity matrix score initially Using the similarity matrixand backtracking technique optimal alignment is obtainedLet X represent the size of 119902

0and Y represents the size of

1199021 For sequence 119902

0 there are X + 1 possible prefixes and

for 1199021 there areY + 1 possible prefixes including the empty

sequence Let 119902seq[1 Y] denote a prefix of Y charactersand 119902seq[X] represents the Xth character of a sequence119902seq The similarity score between prefixes 119902

0[1 119886] and

1199021[1 119887] is represented as 119902

119886119887 The similarity matrix is

denoted by 119885 and its size is (X + 1Y + 1) The first row and

4 BioMed Research International

column of 119885 are initialized to 0 The remaining elements of119885 (indexed by (119886 119887)) are computed using

119885119886119887

= max

0

119868119886119887

119869119886119887

119885119886minus1119887minus1

+119877 (119886 119887)

(1)

where 119877(119886 119887) is a function providing values of exact match ormismatch that is if 119902

0[119886] = 119902

1[119887] then119877(119886 119887) = 119883

119894(identical

characters among sequences 1199020and 1199021) otherwise 119877(119886 119887) =

119872119894(if characters among sequences 119902

0and 119902

1are unique) 119868

and 119869 represent thematrices in accordance with the affine gapmodel The matrices 119868 and 119869 are used to determine the gapsand are defined as

119868119886119887

= max

119868119886119887minus1

minus 119866119864

119885119886119887minus1

minus 119866119865

119869119886119887

= max

119869119886minus1119887

minus 119866119864

119885119886minus1119887

minus 119866119865

(2)

where 119866119864and 119866

119865represent the first and successive gap

penaltyBacktracking algorithm is used to find the optimal align-

ment between 1199020and 1199021 The backtracking begins with a cell

in 119885 that holds the highest score and proceeds till a zerovalued cell is reached Smith-Waterman algorithm providesoptimal alignments and similarity scores It must be notedthat computation ofmatrix119885 is bounded by the running timeof the slowest 119885 task due to the dependencies it exhibits

22 Burrows-Wheeler Alignerrsquos Smith-Waterman (BWA-SW)Alignment Algorithm BWA-SW algorithm constructs afull-text index in minute space (FM-index) [40] of the querysequence 119876 and the reference sequence 119877 A prefix directedacyclic word graph (119875119903119890119891119894119909 119863119860119882119866) is built using sequence119876 119901119903119890119891119894119909 119905119903119894119890 is built using the sequence 119877 The prefix trie isa tree representation of sequence 119877 The tree is constructedby concatenating all edge symbols from any node in the graphas a route to the root node providing a unique string Theunique string obtained is a substring of the sequence 119877 Eachnode of the tree is represented using 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897Traversing from nodes generates strings that are lexico-graphically sorted A node in 119901119903119890119891119894119909 119905119903119894119890 represents a stringThe 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 of the node and the string itrepresents are considered to be equivalent to each otherFrom 119901119903119890119891119894119909 119905119903119894119890 nodes exhibiting identical or similar119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 are collapsed to form 119875119903119890119891119894119909 119863119860119882119866Each node of 119875119903119890119891119894119909 119863119860119882119866 established represents oneor more substrings of 119877 Figure 2 represents 119901119903119890119891119894119909 119905119903119894119890119875119903119890119891119894119909 119863119860119882119866 and 119904119906119891119891119894119909 119886119903119903119886119910 of the input stringldquoBANANA$rdquo

Let PT(119883) and DG(119883) represent two functions thatare used to form 119901119903119890119891119894119909 119905119903119894119890 and 119875119903119890119891119894119909 119863119860119882119866 graph In

the BWA-SA algorithm PT(119877) and DG(119876) are initially com-puted Let 119886 be the root node of PT(119877) and 119887 represents theroot node of DG(119876) The best score between the sequences119876 and 119877 is computed using a dynamic programming mecha-nism Initialize 119864

119886119887= 119865119886119887= 119870119886119887= null considering the root

nodes of PT(119877) and DG(119876) For each of the parents that is119886119901 of the 119886th node in DG(119876) computation of 119865

119886119887|119886119901 119870119886119887|119886119901

and 119864

119886119887|119886119901is considered The computation of 119865

119886119887|119886119901is

119865119886119887|119886119901

= max 119865119886119901119887 119864119886119901119887

minus 119892119900 minus 119892119890 (3)

where 119892119900 is the gap open penalty and 119892119890 is the gap extensionpenalty119870

119886119887|119886119901is computed using

119870119886119887|119886119901

= max 119870119886119887119901 119864119886119887119901

minus 119892119900 minus 119892119890 (4)

where 119887119901is the parent of the 119887th node in PT(119877) The compu-

tation of 119864119886119887|119886119901

is achieved using

119864119886119887|119886119901

= max 119864119886119901119887119901

+ 119874 (119886119901 119886 119887119901 119887) 119865

119886119901119887119901 119870119886119901119887119901

0 (5)

where119874(119886119901 119886 119887119901 119887) represents the score between the symbol

on edge (119886119901 119886) and (119887

119901 119887) The computation of 119864

119886119887 119865119886119887 and

119870119886119887is defined as

(119864119886119887 119865119886119887 119870119886119887)

=

(119864119886119887|1198861015840 119865119886119887|1198861015840 119870119886119887|1198861015840) When 119864

119886119887|1198861015840 gt 0

(minusinfin minusinfin minusinfin) all other cases

(6)

where 1198861015840 = argmax119886119901isin119886119903(119886)

119864119886119887|119886119901

and 119886119903(119886) is a set containingparent nodes of 119886 Variable 119864

119886119887represents the best match-

ing score between the substrings that is substring 119886 andsubstring 119887 Consider 119864

119886119887gt 0 when the substrings 119887 and

119886 match It is known that the Smith-Waterman algorithmprovides accurate alignment results but requires large com-putation time To optimize the computations the reversepostorder traversal scheme on DG(119883) and PT(119883) is adoptedThe dynamic programming mechanism in the BWA-SWalgorithm enables identifying the seed matches of thegenomic sequences Based on the partial matches the Smith-Waterman Alignment is considered to the extended matches119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903 that is (119886 119887) is formed when the bestscore that is 119864

119886119887 is high and 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 size of 119887

is within the threshold setThe seedmatches are derived from119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903119904 by analyzing the suffix array of sequences119876 and119877Matching of the extended seedmatches using Smith-Waterman algorithm is considered only when high matchingscores are observed

3 Proposed Burrows-Wheeler AlignerrsquosSmith-Waterman Alignment on the ParallelMapReduce BWASW-PMR

BWASW-PMR provides a cloud platform to perform longread alignments considering genomic data obtained from

BioMed Research International 5

A NB$

$ N

$ NA

A

A

$

A

$ N

A

$

N

A

N

A

$

(a)

A NABANANA$

$ NA $ NA$

NA$$

(b)

B A N A N A $

NA

$

$

$

(c)

0 BANANA$

1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

String sorting

0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$

4 0 BANANA5 4 NA$BAN6 2 NANA$B

N

NB

$

A

A

A

(d)

Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence

NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs

31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR

is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876

1015840

chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876

1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide

6 BioMed Research International

ChunkR

998400

Local memory1 2

Local memory1 2

Mapoutput

data

Mapoutput

data

Temporary cloud storageMap workers Cloud storageReduce workers

Shuffle Sort Reduce

Shuffle Sort Reduce

Chunk 1

Chunk 2

Burrows-Wheeler Alignerrsquos Smith-

Burrows-Wheeler Alignerrsquos Smith-

Smith-Waterman parallelized

Smith-Waterman parallelized

BWASW-PMR master node

Sequence alignment

results

Query genome Q

Reference genome R

Parallel execution platform

Parallel execution platform

Q998400

Q998400

Cloud storage

Waterman (BWA-SW) Alignment

Waterman (BWA-SW) Alignment

Virtual machinemdashmap worker w

Virtual machinemdashmap worker 1

Virtual machinemdashreduce worker w

Virtual machinemdashreduce worker 1

Figure 3 BWASW-PMR cloud model for long read sequence alignment

alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as

T = TVM Config +TMAP +TREDUCE (7)

The overview of the BWASW-PMR cloud platform is shownin Figure 3

In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877

1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t

119876119868represent

the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism

The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection

Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as

T119908 MAP = (tGet Data M + t

119876119868+

1198761015840times tBWASW119901

+ tStore Data M)

(8)

The average execution time of all the119908mapworker nodesis defined as

TMAP =sum119908

119894=1T119894 MAP

119908

(9)

32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle

BioMed Research International 7

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(a)

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(b)

Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR

function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas

T119908 REDUCE = (tGet Data R +

tFn R119901

+ tStore Data R) (10)

The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as

TREDUCE =sum119908

119894=1T119894 REDUCE119908

(11)

Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as

T = TVM Config +sum119908

119894=1(T119894 MAP +T

119894 REDUCE)

119908

(12)

From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by

the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan

T asymp

sum119908

119894=1(T119894 MAP)

119908

(13)

Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT

33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885

34(shown as black lines in

Figure 4(a)) it requires that the computation of1198852311988524 and

11988533

must be completed Based on the sequences119876 and119877 andvalues of 119885

23 11988524 and 119885

33 the value of 119885

34is obtained

The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion

8 BioMed Research International

Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference

Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp

The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP

4 Evaluation Experiments

In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available

41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1

Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm

Number of bp considered for alignment

Sequence alignmentmdashexecution time

SWSW-optimized

01

1

10

100

Log

scal

ed ex

ecut

ion

time (

s)

sim5 k sim8 k sim10 k sim15 k

Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths

42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

2 BioMed Research International

(1) Short read aligners used to align genomic sequenceswhose read length is between 32 bp and 200 bp thatis BWA [4] and Bowtie2 [5]

(2) Long read aligners used to align genomic sequenceswhose length is greater than 1000 bp that is Illumina

Examples of short read aligners include FASTA [6]BLAST [7] BLAT [8] SOAP [9] and Burrows-WheelerAligner (BWA) [4] For long read alignment researchershave proposed Burrows-Wheeler Alignerrsquos Smith-Waterman(BWA-SW) Alignment [10] Bowtie2 [5] Cushaw2 [11] andBWA-MEM [12]

Existing bioinformatics applications require both man-agement of huge amounts of data and heavy computationfor analysis Nearly 3 billion US dollars and 10 years wererequired to produce the initial human reference genomecontaining about 35 billion base pairs The latest develop-ments in sequencing technologies available produce enor-mous amount of data in terms of gigabytes per run [13]In NGS large sample size applications users are required towait for sufficient resources to become available in which thetime needed to complete processing becomes unpredictableAnalyzing such enormous amount of data and its storageare problems that exist To address these high computa-tion needs grid or cluster computing platforms have beenprovided to researchers [14] The provided grid or clustercomputation platforms are constrained by hardware capacityand concurrent access capacity to support multiple usersThe ever growing gap between computing capabilities andsequencing throughput is presented in [15]

Using cloud platforms one can solve the storage andcomputing problems that exist in gene sequencing Cloudcomputing platforms provide flexibility scalability and on-demand access to resources Moreover adoption of cloudcomputing technologies eliminates the costs incurred inestablishing maintaining large physical storage systemsand computing clusters Cloud users pay for the uti-lized storage and computing resources without worryingabout maintenance availability and reliability related issuesAlthough cloud computing provides scalable and flexibleinfrastructure parallel computing models on cloud plat-formsinfrastructure are required to achieve the desired goalof analyzing gene sequences One such framework for thecloud called MapReduce model [16] was introduced byGoogle Another model developed by University of Califor-nia at Berkeley proposed the Spark platform [17] for cloudcomputing The Pregel framework for cloud environmentsis presented in [18] A comprehensive survey of most recentresearch and development in cloud-based service computingsolutions in research of genome informatics is available in[19 20] In research of genomic analysis there are a numberof cloud computing providers that offer cloud-based servicessolutions such as Google Cloud Genomics [3] Amazon [21]andMicrosoft Azure [22]The necessity for cloud computingfor genomic analysis has been well discussed by leadersin bioinformatics and computational biology [23] Galaxy[24] is a powerful web-based system for performing genomeanalysis tasks and one of many models that manage bioinfor-matics databases Galaxy built on Amazonrsquos Elastic Compute

Cloud (EC2) with CloudMan framework to assist researchersexecutes their own Galaxy server on cloud infrastructureHowever Galaxy has data storage and data manipulationbottlenecks for large datasets with capability of analyzingone sample at a time and does not utilize the elastic cloudcompute capabilities completely This drawback is caused byits dependence on a single shared file system A significantbottleneck has been reported in [25] when processing largedatasets across distributed compute resources Among all thecloud computing platforms available Hadoop MapReduce isby far the most popular choice due to its ease to tune opensource nature and acceptability by industry and academicorganizations In order to utilize the full potential of cloudinfrastructure public cloud service providers offer virtualizedresources in terms of both hardware and software [26] Thevirtualization enables user specific customization flexibilityapplication execution environments and cost efficiency andminimizes power consumption [27 28]

The Hadoop framework [29] adopts a MapReduce modelfor computing a user specific application on cloud platformsIn the MapReduce model a dual phase execution approachis adopted In the initial phase input data to be processed issplit into chunks Each chunk is associated with a mapper ora map worker that provides ⟨Key | Value⟩ pairs as outputsThe outputs values are sorted on the basis of the Key valuesassociatedThe sorted values are provided to reduce workersthat is ⟨Key | SortedList(Value)⟩ Reduce workers store theresults inHadoop distributed file systemThemap and reduceworkers are generally virtual machines (VMs) in public cloudenvironments A simple MapReduce model deployed on theVM based computing environment is shown in Figure 1

Next-Generation Sequencing (NGS) tools like CloudA-ligner [13] Cloud Burst [30] SeqMapReduce [31] and Cross-bow [32] adopt the Hadoop frameworkThemajor drawbackof these alignment tools is that they are short read alignersThe short read aligners prove to be efficient when single-gapor ungapped alignment is to be computed Considering longreads these alignment algorithms exhibit performance deg-radation and affect accuracy proved by the results presentedin [10] For alignment considering long sequence readsoptimization of the BLAST algorithm and its deployment onHadoop platform is presented in [33] For long read alignersthe BWA-SW algorithm in [10] has been found to be efficientand suitable In [34] the Bwasw-Cloud algorithm consider-ing Hadoop platform is presented The Bwasw-Cloud adoptsBWA-SW algorithm for alignment Based on the literaturereviewed it is evident that limited work has been carried outconsidering long read aligners required to support analysis ofgenomic data generated from current and future sequencingtechnologies There is an increasingly urgent need to pro-vide a reliable and scalable support for the ever-increasingconvergence of NGS data However there is clearly a lackof standardized and affordable NGS management solutionson the cloud to support the growing needs of translationalgenomics research [20] This is a major motivating factor forthe authors of this paper

All the existing long read aligners for cloud environmentsconsider the Hadoop framework Genome alignment is an

BioMed Research International 3

Split

Merge

MergeMerge

Inpu

t

Out

put

Map

Map

Partition

Partition

Reduce

Reduce

Worker 1 on VM Worker 1 on VM

Worker N on VM Worker M on VM

Map stage Reduce stage

Cloud storage

Cloud storage

VM based virtualized computing environment

Cloud storage

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 MapReduce model deployed using VM based computing environment

iterative process and Hadoop incurs considerable perfor-mance overheads when iterative applications are hosted onthe framework [18 35]The aligners presented in [33 34] needto rely on multiway joins as genomic sequences are split andneeded to be merged prior to the reduce phase The perfor-mance of Hadoop suffers whenmultiway joins are considered[36]TheHadoop framework considers sequential processingof the map followed by reduce stage that affects performance[37]

Therefore in this paper we present Burrows-WheelerAlignerrsquos Smith-Waterman Alignment on Parallel MapRe-duce (BWASW-PMR) to perform long read alignments ona cloud platform BWASW-PMR adopts the BWA-SW pre-sented in [10] to perform long read alignments of genomicdata MapReduce model is considered for the execution onthe public cloud environment The Smith-Waterman (SW)algorithm [38 39] in BWA-SW is optimized using paral-lel computation technique To overcome the drawbacks ofHadoopMapReduce a parallel execution strategy of the mapand reduce workers is considered in BWASW-PMRThemapand reduce functions executions on the VM based workernodes are parallelized to reduce execution time The alignerpresented in [34] bears the closest similarity to the BWASW-PMR proposed and is considered for comparison The majorcontribution of our work can be summarized here

(i) Optimization of Smith-Waterman in the BWA-SWlong read sequence alignment

(ii) A custom MapReduce framework to support therequired computations for long sequence alignment

(iii) Parallel map and reduce workers execution strategy

(iv) Parallel execution of the map and reduce functions atworker nodes

This paper is organized as follows Section 2 presents relatedwork background Section 3 discusses the proposed Burrows-Wheeler Alignerrsquos Smith-Waterman Alignment on the Par-allel MapReduce BWASW-PMR The SW algorithm and itsconsidered optimization are also presented in Section 3 Theresults and the experimental study are shown in Section 4Comparison notes with related work counterparts are dis-cussed in the penultimate section The concluding remarksand future work are discussed in the last section

2 Background

21 Smith-Waterman (SW) Algorithm Burrows-WheelerAlignerrsquos Smith-Waterman (BWA-SW) Alignment relies onSW algorithm to align the seed matches of sequences Let 119902

0

and 1199021represent two genomic sequences obtained from the

seed matches The Smith-Waterman algorithm computes thesimilarity matrix score initially Using the similarity matrixand backtracking technique optimal alignment is obtainedLet X represent the size of 119902

0and Y represents the size of

1199021 For sequence 119902

0 there are X + 1 possible prefixes and

for 1199021 there areY + 1 possible prefixes including the empty

sequence Let 119902seq[1 Y] denote a prefix of Y charactersand 119902seq[X] represents the Xth character of a sequence119902seq The similarity score between prefixes 119902

0[1 119886] and

1199021[1 119887] is represented as 119902

119886119887 The similarity matrix is

denoted by 119885 and its size is (X + 1Y + 1) The first row and

4 BioMed Research International

column of 119885 are initialized to 0 The remaining elements of119885 (indexed by (119886 119887)) are computed using

119885119886119887

= max

0

119868119886119887

119869119886119887

119885119886minus1119887minus1

+119877 (119886 119887)

(1)

where 119877(119886 119887) is a function providing values of exact match ormismatch that is if 119902

0[119886] = 119902

1[119887] then119877(119886 119887) = 119883

119894(identical

characters among sequences 1199020and 1199021) otherwise 119877(119886 119887) =

119872119894(if characters among sequences 119902

0and 119902

1are unique) 119868

and 119869 represent thematrices in accordance with the affine gapmodel The matrices 119868 and 119869 are used to determine the gapsand are defined as

119868119886119887

= max

119868119886119887minus1

minus 119866119864

119885119886119887minus1

minus 119866119865

119869119886119887

= max

119869119886minus1119887

minus 119866119864

119885119886minus1119887

minus 119866119865

(2)

where 119866119864and 119866

119865represent the first and successive gap

penaltyBacktracking algorithm is used to find the optimal align-

ment between 1199020and 1199021 The backtracking begins with a cell

in 119885 that holds the highest score and proceeds till a zerovalued cell is reached Smith-Waterman algorithm providesoptimal alignments and similarity scores It must be notedthat computation ofmatrix119885 is bounded by the running timeof the slowest 119885 task due to the dependencies it exhibits

22 Burrows-Wheeler Alignerrsquos Smith-Waterman (BWA-SW)Alignment Algorithm BWA-SW algorithm constructs afull-text index in minute space (FM-index) [40] of the querysequence 119876 and the reference sequence 119877 A prefix directedacyclic word graph (119875119903119890119891119894119909 119863119860119882119866) is built using sequence119876 119901119903119890119891119894119909 119905119903119894119890 is built using the sequence 119877 The prefix trie isa tree representation of sequence 119877 The tree is constructedby concatenating all edge symbols from any node in the graphas a route to the root node providing a unique string Theunique string obtained is a substring of the sequence 119877 Eachnode of the tree is represented using 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897Traversing from nodes generates strings that are lexico-graphically sorted A node in 119901119903119890119891119894119909 119905119903119894119890 represents a stringThe 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 of the node and the string itrepresents are considered to be equivalent to each otherFrom 119901119903119890119891119894119909 119905119903119894119890 nodes exhibiting identical or similar119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 are collapsed to form 119875119903119890119891119894119909 119863119860119882119866Each node of 119875119903119890119891119894119909 119863119860119882119866 established represents oneor more substrings of 119877 Figure 2 represents 119901119903119890119891119894119909 119905119903119894119890119875119903119890119891119894119909 119863119860119882119866 and 119904119906119891119891119894119909 119886119903119903119886119910 of the input stringldquoBANANA$rdquo

Let PT(119883) and DG(119883) represent two functions thatare used to form 119901119903119890119891119894119909 119905119903119894119890 and 119875119903119890119891119894119909 119863119860119882119866 graph In

the BWA-SA algorithm PT(119877) and DG(119876) are initially com-puted Let 119886 be the root node of PT(119877) and 119887 represents theroot node of DG(119876) The best score between the sequences119876 and 119877 is computed using a dynamic programming mecha-nism Initialize 119864

119886119887= 119865119886119887= 119870119886119887= null considering the root

nodes of PT(119877) and DG(119876) For each of the parents that is119886119901 of the 119886th node in DG(119876) computation of 119865

119886119887|119886119901 119870119886119887|119886119901

and 119864

119886119887|119886119901is considered The computation of 119865

119886119887|119886119901is

119865119886119887|119886119901

= max 119865119886119901119887 119864119886119901119887

minus 119892119900 minus 119892119890 (3)

where 119892119900 is the gap open penalty and 119892119890 is the gap extensionpenalty119870

119886119887|119886119901is computed using

119870119886119887|119886119901

= max 119870119886119887119901 119864119886119887119901

minus 119892119900 minus 119892119890 (4)

where 119887119901is the parent of the 119887th node in PT(119877) The compu-

tation of 119864119886119887|119886119901

is achieved using

119864119886119887|119886119901

= max 119864119886119901119887119901

+ 119874 (119886119901 119886 119887119901 119887) 119865

119886119901119887119901 119870119886119901119887119901

0 (5)

where119874(119886119901 119886 119887119901 119887) represents the score between the symbol

on edge (119886119901 119886) and (119887

119901 119887) The computation of 119864

119886119887 119865119886119887 and

119870119886119887is defined as

(119864119886119887 119865119886119887 119870119886119887)

=

(119864119886119887|1198861015840 119865119886119887|1198861015840 119870119886119887|1198861015840) When 119864

119886119887|1198861015840 gt 0

(minusinfin minusinfin minusinfin) all other cases

(6)

where 1198861015840 = argmax119886119901isin119886119903(119886)

119864119886119887|119886119901

and 119886119903(119886) is a set containingparent nodes of 119886 Variable 119864

119886119887represents the best match-

ing score between the substrings that is substring 119886 andsubstring 119887 Consider 119864

119886119887gt 0 when the substrings 119887 and

119886 match It is known that the Smith-Waterman algorithmprovides accurate alignment results but requires large com-putation time To optimize the computations the reversepostorder traversal scheme on DG(119883) and PT(119883) is adoptedThe dynamic programming mechanism in the BWA-SWalgorithm enables identifying the seed matches of thegenomic sequences Based on the partial matches the Smith-Waterman Alignment is considered to the extended matches119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903 that is (119886 119887) is formed when the bestscore that is 119864

119886119887 is high and 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 size of 119887

is within the threshold setThe seedmatches are derived from119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903119904 by analyzing the suffix array of sequences119876 and119877Matching of the extended seedmatches using Smith-Waterman algorithm is considered only when high matchingscores are observed

3 Proposed Burrows-Wheeler AlignerrsquosSmith-Waterman Alignment on the ParallelMapReduce BWASW-PMR

BWASW-PMR provides a cloud platform to perform longread alignments considering genomic data obtained from

BioMed Research International 5

A NB$

$ N

$ NA

A

A

$

A

$ N

A

$

N

A

N

A

$

(a)

A NABANANA$

$ NA $ NA$

NA$$

(b)

B A N A N A $

NA

$

$

$

(c)

0 BANANA$

1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

String sorting

0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$

4 0 BANANA5 4 NA$BAN6 2 NANA$B

N

NB

$

A

A

A

(d)

Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence

NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs

31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR

is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876

1015840

chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876

1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide

6 BioMed Research International

ChunkR

998400

Local memory1 2

Local memory1 2

Mapoutput

data

Mapoutput

data

Temporary cloud storageMap workers Cloud storageReduce workers

Shuffle Sort Reduce

Shuffle Sort Reduce

Chunk 1

Chunk 2

Burrows-Wheeler Alignerrsquos Smith-

Burrows-Wheeler Alignerrsquos Smith-

Smith-Waterman parallelized

Smith-Waterman parallelized

BWASW-PMR master node

Sequence alignment

results

Query genome Q

Reference genome R

Parallel execution platform

Parallel execution platform

Q998400

Q998400

Cloud storage

Waterman (BWA-SW) Alignment

Waterman (BWA-SW) Alignment

Virtual machinemdashmap worker w

Virtual machinemdashmap worker 1

Virtual machinemdashreduce worker w

Virtual machinemdashreduce worker 1

Figure 3 BWASW-PMR cloud model for long read sequence alignment

alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as

T = TVM Config +TMAP +TREDUCE (7)

The overview of the BWASW-PMR cloud platform is shownin Figure 3

In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877

1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t

119876119868represent

the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism

The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection

Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as

T119908 MAP = (tGet Data M + t

119876119868+

1198761015840times tBWASW119901

+ tStore Data M)

(8)

The average execution time of all the119908mapworker nodesis defined as

TMAP =sum119908

119894=1T119894 MAP

119908

(9)

32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle

BioMed Research International 7

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(a)

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(b)

Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR

function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas

T119908 REDUCE = (tGet Data R +

tFn R119901

+ tStore Data R) (10)

The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as

TREDUCE =sum119908

119894=1T119894 REDUCE119908

(11)

Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as

T = TVM Config +sum119908

119894=1(T119894 MAP +T

119894 REDUCE)

119908

(12)

From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by

the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan

T asymp

sum119908

119894=1(T119894 MAP)

119908

(13)

Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT

33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885

34(shown as black lines in

Figure 4(a)) it requires that the computation of1198852311988524 and

11988533

must be completed Based on the sequences119876 and119877 andvalues of 119885

23 11988524 and 119885

33 the value of 119885

34is obtained

The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion

8 BioMed Research International

Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference

Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp

The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP

4 Evaluation Experiments

In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available

41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1

Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm

Number of bp considered for alignment

Sequence alignmentmdashexecution time

SWSW-optimized

01

1

10

100

Log

scal

ed ex

ecut

ion

time (

s)

sim5 k sim8 k sim10 k sim15 k

Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths

42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

BioMed Research International 3

Split

Merge

MergeMerge

Inpu

t

Out

put

Map

Map

Partition

Partition

Reduce

Reduce

Worker 1 on VM Worker 1 on VM

Worker N on VM Worker M on VM

Map stage Reduce stage

Cloud storage

Cloud storage

VM based virtualized computing environment

Cloud storage

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 1 MapReduce model deployed using VM based computing environment

iterative process and Hadoop incurs considerable perfor-mance overheads when iterative applications are hosted onthe framework [18 35]The aligners presented in [33 34] needto rely on multiway joins as genomic sequences are split andneeded to be merged prior to the reduce phase The perfor-mance of Hadoop suffers whenmultiway joins are considered[36]TheHadoop framework considers sequential processingof the map followed by reduce stage that affects performance[37]

Therefore in this paper we present Burrows-WheelerAlignerrsquos Smith-Waterman Alignment on Parallel MapRe-duce (BWASW-PMR) to perform long read alignments ona cloud platform BWASW-PMR adopts the BWA-SW pre-sented in [10] to perform long read alignments of genomicdata MapReduce model is considered for the execution onthe public cloud environment The Smith-Waterman (SW)algorithm [38 39] in BWA-SW is optimized using paral-lel computation technique To overcome the drawbacks ofHadoopMapReduce a parallel execution strategy of the mapand reduce workers is considered in BWASW-PMRThemapand reduce functions executions on the VM based workernodes are parallelized to reduce execution time The alignerpresented in [34] bears the closest similarity to the BWASW-PMR proposed and is considered for comparison The majorcontribution of our work can be summarized here

(i) Optimization of Smith-Waterman in the BWA-SWlong read sequence alignment

(ii) A custom MapReduce framework to support therequired computations for long sequence alignment

(iii) Parallel map and reduce workers execution strategy

(iv) Parallel execution of the map and reduce functions atworker nodes

This paper is organized as follows Section 2 presents relatedwork background Section 3 discusses the proposed Burrows-Wheeler Alignerrsquos Smith-Waterman Alignment on the Par-allel MapReduce BWASW-PMR The SW algorithm and itsconsidered optimization are also presented in Section 3 Theresults and the experimental study are shown in Section 4Comparison notes with related work counterparts are dis-cussed in the penultimate section The concluding remarksand future work are discussed in the last section

2 Background

21 Smith-Waterman (SW) Algorithm Burrows-WheelerAlignerrsquos Smith-Waterman (BWA-SW) Alignment relies onSW algorithm to align the seed matches of sequences Let 119902

0

and 1199021represent two genomic sequences obtained from the

seed matches The Smith-Waterman algorithm computes thesimilarity matrix score initially Using the similarity matrixand backtracking technique optimal alignment is obtainedLet X represent the size of 119902

0and Y represents the size of

1199021 For sequence 119902

0 there are X + 1 possible prefixes and

for 1199021 there areY + 1 possible prefixes including the empty

sequence Let 119902seq[1 Y] denote a prefix of Y charactersand 119902seq[X] represents the Xth character of a sequence119902seq The similarity score between prefixes 119902

0[1 119886] and

1199021[1 119887] is represented as 119902

119886119887 The similarity matrix is

denoted by 119885 and its size is (X + 1Y + 1) The first row and

4 BioMed Research International

column of 119885 are initialized to 0 The remaining elements of119885 (indexed by (119886 119887)) are computed using

119885119886119887

= max

0

119868119886119887

119869119886119887

119885119886minus1119887minus1

+119877 (119886 119887)

(1)

where 119877(119886 119887) is a function providing values of exact match ormismatch that is if 119902

0[119886] = 119902

1[119887] then119877(119886 119887) = 119883

119894(identical

characters among sequences 1199020and 1199021) otherwise 119877(119886 119887) =

119872119894(if characters among sequences 119902

0and 119902

1are unique) 119868

and 119869 represent thematrices in accordance with the affine gapmodel The matrices 119868 and 119869 are used to determine the gapsand are defined as

119868119886119887

= max

119868119886119887minus1

minus 119866119864

119885119886119887minus1

minus 119866119865

119869119886119887

= max

119869119886minus1119887

minus 119866119864

119885119886minus1119887

minus 119866119865

(2)

where 119866119864and 119866

119865represent the first and successive gap

penaltyBacktracking algorithm is used to find the optimal align-

ment between 1199020and 1199021 The backtracking begins with a cell

in 119885 that holds the highest score and proceeds till a zerovalued cell is reached Smith-Waterman algorithm providesoptimal alignments and similarity scores It must be notedthat computation ofmatrix119885 is bounded by the running timeof the slowest 119885 task due to the dependencies it exhibits

22 Burrows-Wheeler Alignerrsquos Smith-Waterman (BWA-SW)Alignment Algorithm BWA-SW algorithm constructs afull-text index in minute space (FM-index) [40] of the querysequence 119876 and the reference sequence 119877 A prefix directedacyclic word graph (119875119903119890119891119894119909 119863119860119882119866) is built using sequence119876 119901119903119890119891119894119909 119905119903119894119890 is built using the sequence 119877 The prefix trie isa tree representation of sequence 119877 The tree is constructedby concatenating all edge symbols from any node in the graphas a route to the root node providing a unique string Theunique string obtained is a substring of the sequence 119877 Eachnode of the tree is represented using 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897Traversing from nodes generates strings that are lexico-graphically sorted A node in 119901119903119890119891119894119909 119905119903119894119890 represents a stringThe 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 of the node and the string itrepresents are considered to be equivalent to each otherFrom 119901119903119890119891119894119909 119905119903119894119890 nodes exhibiting identical or similar119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 are collapsed to form 119875119903119890119891119894119909 119863119860119882119866Each node of 119875119903119890119891119894119909 119863119860119882119866 established represents oneor more substrings of 119877 Figure 2 represents 119901119903119890119891119894119909 119905119903119894119890119875119903119890119891119894119909 119863119860119882119866 and 119904119906119891119891119894119909 119886119903119903119886119910 of the input stringldquoBANANA$rdquo

Let PT(119883) and DG(119883) represent two functions thatare used to form 119901119903119890119891119894119909 119905119903119894119890 and 119875119903119890119891119894119909 119863119860119882119866 graph In

the BWA-SA algorithm PT(119877) and DG(119876) are initially com-puted Let 119886 be the root node of PT(119877) and 119887 represents theroot node of DG(119876) The best score between the sequences119876 and 119877 is computed using a dynamic programming mecha-nism Initialize 119864

119886119887= 119865119886119887= 119870119886119887= null considering the root

nodes of PT(119877) and DG(119876) For each of the parents that is119886119901 of the 119886th node in DG(119876) computation of 119865

119886119887|119886119901 119870119886119887|119886119901

and 119864

119886119887|119886119901is considered The computation of 119865

119886119887|119886119901is

119865119886119887|119886119901

= max 119865119886119901119887 119864119886119901119887

minus 119892119900 minus 119892119890 (3)

where 119892119900 is the gap open penalty and 119892119890 is the gap extensionpenalty119870

119886119887|119886119901is computed using

119870119886119887|119886119901

= max 119870119886119887119901 119864119886119887119901

minus 119892119900 minus 119892119890 (4)

where 119887119901is the parent of the 119887th node in PT(119877) The compu-

tation of 119864119886119887|119886119901

is achieved using

119864119886119887|119886119901

= max 119864119886119901119887119901

+ 119874 (119886119901 119886 119887119901 119887) 119865

119886119901119887119901 119870119886119901119887119901

0 (5)

where119874(119886119901 119886 119887119901 119887) represents the score between the symbol

on edge (119886119901 119886) and (119887

119901 119887) The computation of 119864

119886119887 119865119886119887 and

119870119886119887is defined as

(119864119886119887 119865119886119887 119870119886119887)

=

(119864119886119887|1198861015840 119865119886119887|1198861015840 119870119886119887|1198861015840) When 119864

119886119887|1198861015840 gt 0

(minusinfin minusinfin minusinfin) all other cases

(6)

where 1198861015840 = argmax119886119901isin119886119903(119886)

119864119886119887|119886119901

and 119886119903(119886) is a set containingparent nodes of 119886 Variable 119864

119886119887represents the best match-

ing score between the substrings that is substring 119886 andsubstring 119887 Consider 119864

119886119887gt 0 when the substrings 119887 and

119886 match It is known that the Smith-Waterman algorithmprovides accurate alignment results but requires large com-putation time To optimize the computations the reversepostorder traversal scheme on DG(119883) and PT(119883) is adoptedThe dynamic programming mechanism in the BWA-SWalgorithm enables identifying the seed matches of thegenomic sequences Based on the partial matches the Smith-Waterman Alignment is considered to the extended matches119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903 that is (119886 119887) is formed when the bestscore that is 119864

119886119887 is high and 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 size of 119887

is within the threshold setThe seedmatches are derived from119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903119904 by analyzing the suffix array of sequences119876 and119877Matching of the extended seedmatches using Smith-Waterman algorithm is considered only when high matchingscores are observed

3 Proposed Burrows-Wheeler AlignerrsquosSmith-Waterman Alignment on the ParallelMapReduce BWASW-PMR

BWASW-PMR provides a cloud platform to perform longread alignments considering genomic data obtained from

BioMed Research International 5

A NB$

$ N

$ NA

A

A

$

A

$ N

A

$

N

A

N

A

$

(a)

A NABANANA$

$ NA $ NA$

NA$$

(b)

B A N A N A $

NA

$

$

$

(c)

0 BANANA$

1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

String sorting

0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$

4 0 BANANA5 4 NA$BAN6 2 NANA$B

N

NB

$

A

A

A

(d)

Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence

NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs

31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR

is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876

1015840

chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876

1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide

6 BioMed Research International

ChunkR

998400

Local memory1 2

Local memory1 2

Mapoutput

data

Mapoutput

data

Temporary cloud storageMap workers Cloud storageReduce workers

Shuffle Sort Reduce

Shuffle Sort Reduce

Chunk 1

Chunk 2

Burrows-Wheeler Alignerrsquos Smith-

Burrows-Wheeler Alignerrsquos Smith-

Smith-Waterman parallelized

Smith-Waterman parallelized

BWASW-PMR master node

Sequence alignment

results

Query genome Q

Reference genome R

Parallel execution platform

Parallel execution platform

Q998400

Q998400

Cloud storage

Waterman (BWA-SW) Alignment

Waterman (BWA-SW) Alignment

Virtual machinemdashmap worker w

Virtual machinemdashmap worker 1

Virtual machinemdashreduce worker w

Virtual machinemdashreduce worker 1

Figure 3 BWASW-PMR cloud model for long read sequence alignment

alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as

T = TVM Config +TMAP +TREDUCE (7)

The overview of the BWASW-PMR cloud platform is shownin Figure 3

In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877

1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t

119876119868represent

the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism

The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection

Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as

T119908 MAP = (tGet Data M + t

119876119868+

1198761015840times tBWASW119901

+ tStore Data M)

(8)

The average execution time of all the119908mapworker nodesis defined as

TMAP =sum119908

119894=1T119894 MAP

119908

(9)

32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle

BioMed Research International 7

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(a)

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(b)

Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR

function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas

T119908 REDUCE = (tGet Data R +

tFn R119901

+ tStore Data R) (10)

The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as

TREDUCE =sum119908

119894=1T119894 REDUCE119908

(11)

Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as

T = TVM Config +sum119908

119894=1(T119894 MAP +T

119894 REDUCE)

119908

(12)

From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by

the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan

T asymp

sum119908

119894=1(T119894 MAP)

119908

(13)

Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT

33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885

34(shown as black lines in

Figure 4(a)) it requires that the computation of1198852311988524 and

11988533

must be completed Based on the sequences119876 and119877 andvalues of 119885

23 11988524 and 119885

33 the value of 119885

34is obtained

The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion

8 BioMed Research International

Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference

Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp

The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP

4 Evaluation Experiments

In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available

41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1

Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm

Number of bp considered for alignment

Sequence alignmentmdashexecution time

SWSW-optimized

01

1

10

100

Log

scal

ed ex

ecut

ion

time (

s)

sim5 k sim8 k sim10 k sim15 k

Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths

42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

4 BioMed Research International

column of 119885 are initialized to 0 The remaining elements of119885 (indexed by (119886 119887)) are computed using

119885119886119887

= max

0

119868119886119887

119869119886119887

119885119886minus1119887minus1

+119877 (119886 119887)

(1)

where 119877(119886 119887) is a function providing values of exact match ormismatch that is if 119902

0[119886] = 119902

1[119887] then119877(119886 119887) = 119883

119894(identical

characters among sequences 1199020and 1199021) otherwise 119877(119886 119887) =

119872119894(if characters among sequences 119902

0and 119902

1are unique) 119868

and 119869 represent thematrices in accordance with the affine gapmodel The matrices 119868 and 119869 are used to determine the gapsand are defined as

119868119886119887

= max

119868119886119887minus1

minus 119866119864

119885119886119887minus1

minus 119866119865

119869119886119887

= max

119869119886minus1119887

minus 119866119864

119885119886minus1119887

minus 119866119865

(2)

where 119866119864and 119866

119865represent the first and successive gap

penaltyBacktracking algorithm is used to find the optimal align-

ment between 1199020and 1199021 The backtracking begins with a cell

in 119885 that holds the highest score and proceeds till a zerovalued cell is reached Smith-Waterman algorithm providesoptimal alignments and similarity scores It must be notedthat computation ofmatrix119885 is bounded by the running timeof the slowest 119885 task due to the dependencies it exhibits

22 Burrows-Wheeler Alignerrsquos Smith-Waterman (BWA-SW)Alignment Algorithm BWA-SW algorithm constructs afull-text index in minute space (FM-index) [40] of the querysequence 119876 and the reference sequence 119877 A prefix directedacyclic word graph (119875119903119890119891119894119909 119863119860119882119866) is built using sequence119876 119901119903119890119891119894119909 119905119903119894119890 is built using the sequence 119877 The prefix trie isa tree representation of sequence 119877 The tree is constructedby concatenating all edge symbols from any node in the graphas a route to the root node providing a unique string Theunique string obtained is a substring of the sequence 119877 Eachnode of the tree is represented using 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897Traversing from nodes generates strings that are lexico-graphically sorted A node in 119901119903119890119891119894119909 119905119903119894119890 represents a stringThe 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 of the node and the string itrepresents are considered to be equivalent to each otherFrom 119901119903119890119891119894119909 119905119903119894119890 nodes exhibiting identical or similar119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 are collapsed to form 119875119903119890119891119894119909 119863119860119882119866Each node of 119875119903119890119891119894119909 119863119860119882119866 established represents oneor more substrings of 119877 Figure 2 represents 119901119903119890119891119894119909 119905119903119894119890119875119903119890119891119894119909 119863119860119882119866 and 119904119906119891119891119894119909 119886119903119903119886119910 of the input stringldquoBANANA$rdquo

Let PT(119883) and DG(119883) represent two functions thatare used to form 119901119903119890119891119894119909 119905119903119894119890 and 119875119903119890119891119894119909 119863119860119882119866 graph In

the BWA-SA algorithm PT(119877) and DG(119876) are initially com-puted Let 119886 be the root node of PT(119877) and 119887 represents theroot node of DG(119876) The best score between the sequences119876 and 119877 is computed using a dynamic programming mecha-nism Initialize 119864

119886119887= 119865119886119887= 119870119886119887= null considering the root

nodes of PT(119877) and DG(119876) For each of the parents that is119886119901 of the 119886th node in DG(119876) computation of 119865

119886119887|119886119901 119870119886119887|119886119901

and 119864

119886119887|119886119901is considered The computation of 119865

119886119887|119886119901is

119865119886119887|119886119901

= max 119865119886119901119887 119864119886119901119887

minus 119892119900 minus 119892119890 (3)

where 119892119900 is the gap open penalty and 119892119890 is the gap extensionpenalty119870

119886119887|119886119901is computed using

119870119886119887|119886119901

= max 119870119886119887119901 119864119886119887119901

minus 119892119900 minus 119892119890 (4)

where 119887119901is the parent of the 119887th node in PT(119877) The compu-

tation of 119864119886119887|119886119901

is achieved using

119864119886119887|119886119901

= max 119864119886119901119887119901

+ 119874 (119886119901 119886 119887119901 119887) 119865

119886119901119887119901 119870119886119901119887119901

0 (5)

where119874(119886119901 119886 119887119901 119887) represents the score between the symbol

on edge (119886119901 119886) and (119887

119901 119887) The computation of 119864

119886119887 119865119886119887 and

119870119886119887is defined as

(119864119886119887 119865119886119887 119870119886119887)

=

(119864119886119887|1198861015840 119865119886119887|1198861015840 119870119886119887|1198861015840) When 119864

119886119887|1198861015840 gt 0

(minusinfin minusinfin minusinfin) all other cases

(6)

where 1198861015840 = argmax119886119901isin119886119903(119886)

119864119886119887|119886119901

and 119886119903(119886) is a set containingparent nodes of 119886 Variable 119864

119886119887represents the best match-

ing score between the substrings that is substring 119886 andsubstring 119887 Consider 119864

119886119887gt 0 when the substrings 119887 and

119886 match It is known that the Smith-Waterman algorithmprovides accurate alignment results but requires large com-putation time To optimize the computations the reversepostorder traversal scheme on DG(119883) and PT(119883) is adoptedThe dynamic programming mechanism in the BWA-SWalgorithm enables identifying the seed matches of thegenomic sequences Based on the partial matches the Smith-Waterman Alignment is considered to the extended matches119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903 that is (119886 119887) is formed when the bestscore that is 119864

119886119887 is high and 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 size of 119887

is within the threshold setThe seedmatches are derived from119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903119904 by analyzing the suffix array of sequences119876 and119877Matching of the extended seedmatches using Smith-Waterman algorithm is considered only when high matchingscores are observed

3 Proposed Burrows-Wheeler AlignerrsquosSmith-Waterman Alignment on the ParallelMapReduce BWASW-PMR

BWASW-PMR provides a cloud platform to perform longread alignments considering genomic data obtained from

BioMed Research International 5

A NB$

$ N

$ NA

A

A

$

A

$ N

A

$

N

A

N

A

$

(a)

A NABANANA$

$ NA $ NA$

NA$$

(b)

B A N A N A $

NA

$

$

$

(c)

0 BANANA$

1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

String sorting

0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$

4 0 BANANA5 4 NA$BAN6 2 NANA$B

N

NB

$

A

A

A

(d)

Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence

NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs

31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR

is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876

1015840

chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876

1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide

6 BioMed Research International

ChunkR

998400

Local memory1 2

Local memory1 2

Mapoutput

data

Mapoutput

data

Temporary cloud storageMap workers Cloud storageReduce workers

Shuffle Sort Reduce

Shuffle Sort Reduce

Chunk 1

Chunk 2

Burrows-Wheeler Alignerrsquos Smith-

Burrows-Wheeler Alignerrsquos Smith-

Smith-Waterman parallelized

Smith-Waterman parallelized

BWASW-PMR master node

Sequence alignment

results

Query genome Q

Reference genome R

Parallel execution platform

Parallel execution platform

Q998400

Q998400

Cloud storage

Waterman (BWA-SW) Alignment

Waterman (BWA-SW) Alignment

Virtual machinemdashmap worker w

Virtual machinemdashmap worker 1

Virtual machinemdashreduce worker w

Virtual machinemdashreduce worker 1

Figure 3 BWASW-PMR cloud model for long read sequence alignment

alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as

T = TVM Config +TMAP +TREDUCE (7)

The overview of the BWASW-PMR cloud platform is shownin Figure 3

In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877

1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t

119876119868represent

the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism

The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection

Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as

T119908 MAP = (tGet Data M + t

119876119868+

1198761015840times tBWASW119901

+ tStore Data M)

(8)

The average execution time of all the119908mapworker nodesis defined as

TMAP =sum119908

119894=1T119894 MAP

119908

(9)

32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle

BioMed Research International 7

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(a)

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(b)

Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR

function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas

T119908 REDUCE = (tGet Data R +

tFn R119901

+ tStore Data R) (10)

The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as

TREDUCE =sum119908

119894=1T119894 REDUCE119908

(11)

Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as

T = TVM Config +sum119908

119894=1(T119894 MAP +T

119894 REDUCE)

119908

(12)

From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by

the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan

T asymp

sum119908

119894=1(T119894 MAP)

119908

(13)

Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT

33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885

34(shown as black lines in

Figure 4(a)) it requires that the computation of1198852311988524 and

11988533

must be completed Based on the sequences119876 and119877 andvalues of 119885

23 11988524 and 119885

33 the value of 119885

34is obtained

The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion

8 BioMed Research International

Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference

Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp

The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP

4 Evaluation Experiments

In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available

41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1

Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm

Number of bp considered for alignment

Sequence alignmentmdashexecution time

SWSW-optimized

01

1

10

100

Log

scal

ed ex

ecut

ion

time (

s)

sim5 k sim8 k sim10 k sim15 k

Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths

42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

BioMed Research International 5

A NB$

$ N

$ NA

A

A

$

A

$ N

A

$

N

A

N

A

$

(a)

A NABANANA$

$ NA $ NA$

NA$$

(b)

B A N A N A $

NA

$

$

$

(c)

0 BANANA$

1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA

String sorting

0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$

4 0 BANANA5 4 NA$BAN6 2 NANA$B

N

NB

$

A

A

A

(d)

Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence

NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs

31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR

is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876

1015840

chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876

1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide

6 BioMed Research International

ChunkR

998400

Local memory1 2

Local memory1 2

Mapoutput

data

Mapoutput

data

Temporary cloud storageMap workers Cloud storageReduce workers

Shuffle Sort Reduce

Shuffle Sort Reduce

Chunk 1

Chunk 2

Burrows-Wheeler Alignerrsquos Smith-

Burrows-Wheeler Alignerrsquos Smith-

Smith-Waterman parallelized

Smith-Waterman parallelized

BWASW-PMR master node

Sequence alignment

results

Query genome Q

Reference genome R

Parallel execution platform

Parallel execution platform

Q998400

Q998400

Cloud storage

Waterman (BWA-SW) Alignment

Waterman (BWA-SW) Alignment

Virtual machinemdashmap worker w

Virtual machinemdashmap worker 1

Virtual machinemdashreduce worker w

Virtual machinemdashreduce worker 1

Figure 3 BWASW-PMR cloud model for long read sequence alignment

alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as

T = TVM Config +TMAP +TREDUCE (7)

The overview of the BWASW-PMR cloud platform is shownin Figure 3

In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877

1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t

119876119868represent

the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism

The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection

Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as

T119908 MAP = (tGet Data M + t

119876119868+

1198761015840times tBWASW119901

+ tStore Data M)

(8)

The average execution time of all the119908mapworker nodesis defined as

TMAP =sum119908

119894=1T119894 MAP

119908

(9)

32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle

BioMed Research International 7

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(a)

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(b)

Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR

function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas

T119908 REDUCE = (tGet Data R +

tFn R119901

+ tStore Data R) (10)

The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as

TREDUCE =sum119908

119894=1T119894 REDUCE119908

(11)

Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as

T = TVM Config +sum119908

119894=1(T119894 MAP +T

119894 REDUCE)

119908

(12)

From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by

the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan

T asymp

sum119908

119894=1(T119894 MAP)

119908

(13)

Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT

33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885

34(shown as black lines in

Figure 4(a)) it requires that the computation of1198852311988524 and

11988533

must be completed Based on the sequences119876 and119877 andvalues of 119885

23 11988524 and 119885

33 the value of 119885

34is obtained

The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion

8 BioMed Research International

Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference

Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp

The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP

4 Evaluation Experiments

In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available

41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1

Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm

Number of bp considered for alignment

Sequence alignmentmdashexecution time

SWSW-optimized

01

1

10

100

Log

scal

ed ex

ecut

ion

time (

s)

sim5 k sim8 k sim10 k sim15 k

Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths

42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

6 BioMed Research International

ChunkR

998400

Local memory1 2

Local memory1 2

Mapoutput

data

Mapoutput

data

Temporary cloud storageMap workers Cloud storageReduce workers

Shuffle Sort Reduce

Shuffle Sort Reduce

Chunk 1

Chunk 2

Burrows-Wheeler Alignerrsquos Smith-

Burrows-Wheeler Alignerrsquos Smith-

Smith-Waterman parallelized

Smith-Waterman parallelized

BWASW-PMR master node

Sequence alignment

results

Query genome Q

Reference genome R

Parallel execution platform

Parallel execution platform

Q998400

Q998400

Cloud storage

Waterman (BWA-SW) Alignment

Waterman (BWA-SW) Alignment

Virtual machinemdashmap worker w

Virtual machinemdashmap worker 1

Virtual machinemdashreduce worker w

Virtual machinemdashreduce worker 1

Figure 3 BWASW-PMR cloud model for long read sequence alignment

alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as

T = TVM Config +TMAP +TREDUCE (7)

The overview of the BWASW-PMR cloud platform is shownin Figure 3

In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877

1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t

119876119868represent

the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism

The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection

Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as

T119908 MAP = (tGet Data M + t

119876119868+

1198761015840times tBWASW119901

+ tStore Data M)

(8)

The average execution time of all the119908mapworker nodesis defined as

TMAP =sum119908

119894=1T119894 MAP

119908

(9)

32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle

BioMed Research International 7

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(a)

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(b)

Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR

function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas

T119908 REDUCE = (tGet Data R +

tFn R119901

+ tStore Data R) (10)

The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as

TREDUCE =sum119908

119894=1T119894 REDUCE119908

(11)

Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as

T = TVM Config +sum119908

119894=1(T119894 MAP +T

119894 REDUCE)

119908

(12)

From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by

the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan

T asymp

sum119908

119894=1(T119894 MAP)

119908

(13)

Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT

33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885

34(shown as black lines in

Figure 4(a)) it requires that the computation of1198852311988524 and

11988533

must be completed Based on the sequences119876 and119877 andvalues of 119885

23 11988524 and 119885

33 the value of 119885

34is obtained

The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion

8 BioMed Research International

Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference

Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp

The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP

4 Evaluation Experiments

In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available

41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1

Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm

Number of bp considered for alignment

Sequence alignmentmdashexecution time

SWSW-optimized

01

1

10

100

Log

scal

ed ex

ecut

ion

time (

s)

sim5 k sim8 k sim10 k sim15 k

Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths

42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

BioMed Research International 7

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(a)

Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19

Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29

Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39

Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49

Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59

(b)

Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR

function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas

T119908 REDUCE = (tGet Data R +

tFn R119901

+ tStore Data R) (10)

The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as

TREDUCE =sum119908

119894=1T119894 REDUCE119908

(11)

Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as

T = TVM Config +sum119908

119894=1(T119894 MAP +T

119894 REDUCE)

119908

(12)

From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by

the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan

T asymp

sum119908

119894=1(T119894 MAP)

119908

(13)

Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT

33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885

34(shown as black lines in

Figure 4(a)) it requires that the computation of1198852311988524 and

11988533

must be completed Based on the sequences119876 and119877 andvalues of 119885

23 11988524 and 119885

33 the value of 119885

34is obtained

The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion

8 BioMed Research International

Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference

Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp

The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP

4 Evaluation Experiments

In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available

41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1

Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm

Number of bp considered for alignment

Sequence alignmentmdashexecution time

SWSW-optimized

01

1

10

100

Log

scal

ed ex

ecut

ion

time (

s)

sim5 k sim8 k sim10 k sim15 k

Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths

42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

8 BioMed Research International

Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference

Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp

The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP

4 Evaluation Experiments

In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available

41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1

Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm

Number of bp considered for alignment

Sequence alignmentmdashexecution time

SWSW-optimized

01

1

10

100

Log

scal

ed ex

ecut

ion

time (

s)

sim5 k sim8 k sim10 k sim15 k

Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths

42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

BioMed Research International 9

Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud

Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession

number ) Length

1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome V (BK0069392) 576874 bp

2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XI (BK0069442) 666816 bp

3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome X (BK0069432) 745751 bp

4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome II (BK0069362) 813184 bp

5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae

S288c chromosome XVI (BK0069492) 948066 bp

1 2 3 4 5Experiment number

Long read alignmentmdashmakespan time

35

55

75

95

115

135

155

175

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Figure 6 Makespan observations considering 5 long read sequencealignment experiments

43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the

1000 5000 10000Length of reads

05

1015202530354045

Mak

espa

n tim

e (s)

BWASW-PMRBwasw-Cloud

Long read alignmentmdashmakespan time on public cloud

Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments

execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study

Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

10 BioMed Research International

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

2 4 6 8 10 12 14 16 180Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

5 10 15 20 250Makespan time (s)

Long read sequence aligner execution considering

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

5 10 15 20 25 300Makespan time (s)

Bwasw-Cloud (10000 b read)

Bwasw-Cloud (1000 b read)

Bwasw-Cloud (5000 b read)

(a)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 120Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

n

Long read sequence aligner execution considering

2 4 6 8 10 12 14 16 180Makespan time (s)

Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4

Map worker 1Map worker 2Map worker 3Map worker 4

Task

exec

utio

nLong read sequence aligner execution considering

5 10 15 200Makespan time (s)

BWASW-PMR (10000 b read)

BWASW-PMR (5000 b read)

BWASW-PMR (1000 b read)

(b)

Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes

function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs

The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results

obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform

5 Comparison Notes withRelated Work Counterparts

In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

BioMed Research International 11

Table 3 Comparison of BWASW-PMR with its related work counterparts

BWASW-PMR Bwasw-Cloud[34]

MapReduce-BLAST[33] CloudAligner [13] BWA-SW

[10]Long sequence alignment support Yes Yes Yes Yes Yes

Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW

SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes

a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3

6 Conclusion and Future Work

Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long

sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study

In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)

References

[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004

[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics

[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov

[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009

[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012

[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 12: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

12 BioMed Research International

[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997

[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002

[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008

[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010

[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012

[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1

[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011

[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999

[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010

[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004

[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010

[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010

[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014

[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010

[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics

[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us

[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics

[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-

ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014

[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012

[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015

[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005

[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with

MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service

for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009

[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009

[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011

[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014

[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010

[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011

[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014

[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981

[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982

[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000

[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011

[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995

[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 13: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

BioMed Research International 13

[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg

[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 14: Research Article Long Read Alignment with Parallel ...downloads.hindawi.com/journals/bmri/2015/807407.pdfSmith-Waterman (SW) Algorithm. Burrows-Wheeler Aligner s Smith-Waterman (BWA-SW)

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology