Research Article Long Read Alignment with Parallel...
Transcript of Research Article Long Read Alignment with Parallel...
Research ArticleLong Read Alignment with Parallel MapReduce Cloud Platform
Ahmed Abdulhakim Al-Absi1 and Dae-Ki Kang2
1Department of Ubiquitous IT Graduate School Dongseo University 47 Jurye-ro Sasang-gu Busan 47011 Republic of Korea2Department of Computer amp Information Engineering Dongseo University 47 Jurye-ro Sasang-gu Busan 47011 Republic of Korea
Correspondence should be addressed to Dae-Ki Kang dkkangdongseoackr
Received 22 September 2015 Revised 18 November 2015 Accepted 18 November 2015
Academic Editor Elaine Leung
Copyright copy 2015 A A Al-Absi and D-K KangThis is an open access article distributed under theCreativeCommonsAttributionLicense which permits unrestricted use distribution and reproduction in anymedium provided the originalwork is properly cited
Genomic sequence alignment is an important technique to decode genome sequences in bioinformatics Next-GenerationSequencing technologies produce genomic data of longer reads Cloud platforms are adopted to address the problems arising fromstorage and analysis of large genomic data Existing genes sequencing tools for cloud platforms predominantly consider shortread gene sequences and adopt the HadoopMapReduce framework for computation However serial execution of map and reducephases is a problem in such systemsTherefore in this paper we introduce Burrows-Wheeler Alignerrsquos Smith-Waterman Alignmenton Parallel MapReduce (BWASW-PMR) cloud platform for long sequence alignmentThe proposed cloud platform adopts a widelyaccepted and accurate BWA-SW algorithm for long sequence alignment A customMapReduce platform is developed to overcomethe drawbacks of the Hadoop framework A parallel execution strategy of the MapReduce phases and optimization of Smith-Waterman algorithm are considered Performance evaluation results exhibit an average speed-up of 67 considering BWASW-PMRcompared with the state-of-the-art Bwasw-Cloud An average reduction of 30 in the map phase makespan is reported across allexperiments comparing BWASW-PMR with Bwasw-Cloud Optimization of Smith-Waterman results in reducing the executiontime by 918 The experimental study proves the efficiency of BWASW-PMR for aligning long genomic sequences on cloudplatforms
1 Introduction
Bioinformatics involves the biological genomic statisticsmathematics and computer science disciplines of studyAnalysis of genomic and biological sequence data is one of theimportant parts of bioinformatics [1]The analysis of genomicsequences enables us to understand the genetic structure andits bases of drug response and disease Numerous researcherswho are working in this interdisciplinary field perform genestructure and functionality analysis in order to discovernew gene sequences and understand gene origin For theanalysis purpose existing genomic databases such as GoogleGenomics [2] and NCBI [3] (generally in 100rsquos of GB in size)are used to identify the similarities Identification of sim-ilaritiesdissimilarities is achieved by sequence comparisonalgorithms Comparisons of biological sequences producematching alignments and similarity scores These similar-ity scores represent the similaritiesdissimilarities between
the considered biological sequences The matching align-ments and similarity scores are used for secondary structurepredictions and multiple sequence alignments which arehighly complex operations that rely on the accuracy of thecomparison algorithm used Applications related to cancerresearch forensics agrigenomics genetic disease identifica-tion microbial research reproductive health human whole-genome sequencing and many more rely on sequence align-ment algorithms for analysis
Genomic sequencing data is obtained from the devel-oped Next-Generation Sequencing (NGS) technologies (egfrom Illumina Solexa and Pacific Bioscience) Most ofthe genomic data available consists of millions of genomicsequences of short reads [4]With the advances in sequencingtechnologies millions of sequences with greater read lengthsgt 1000 bp are now being generated Based on genomic datathe sequencing tools can be basically classified into two cat-egories
Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 807407 13 pageshttpdxdoiorg1011552015807407
2 BioMed Research International
(1) Short read aligners used to align genomic sequenceswhose read length is between 32 bp and 200 bp thatis BWA [4] and Bowtie2 [5]
(2) Long read aligners used to align genomic sequenceswhose length is greater than 1000 bp that is Illumina
Examples of short read aligners include FASTA [6]BLAST [7] BLAT [8] SOAP [9] and Burrows-WheelerAligner (BWA) [4] For long read alignment researchershave proposed Burrows-Wheeler Alignerrsquos Smith-Waterman(BWA-SW) Alignment [10] Bowtie2 [5] Cushaw2 [11] andBWA-MEM [12]
Existing bioinformatics applications require both man-agement of huge amounts of data and heavy computationfor analysis Nearly 3 billion US dollars and 10 years wererequired to produce the initial human reference genomecontaining about 35 billion base pairs The latest develop-ments in sequencing technologies available produce enor-mous amount of data in terms of gigabytes per run [13]In NGS large sample size applications users are required towait for sufficient resources to become available in which thetime needed to complete processing becomes unpredictableAnalyzing such enormous amount of data and its storageare problems that exist To address these high computa-tion needs grid or cluster computing platforms have beenprovided to researchers [14] The provided grid or clustercomputation platforms are constrained by hardware capacityand concurrent access capacity to support multiple usersThe ever growing gap between computing capabilities andsequencing throughput is presented in [15]
Using cloud platforms one can solve the storage andcomputing problems that exist in gene sequencing Cloudcomputing platforms provide flexibility scalability and on-demand access to resources Moreover adoption of cloudcomputing technologies eliminates the costs incurred inestablishing maintaining large physical storage systemsand computing clusters Cloud users pay for the uti-lized storage and computing resources without worryingabout maintenance availability and reliability related issuesAlthough cloud computing provides scalable and flexibleinfrastructure parallel computing models on cloud plat-formsinfrastructure are required to achieve the desired goalof analyzing gene sequences One such framework for thecloud called MapReduce model [16] was introduced byGoogle Another model developed by University of Califor-nia at Berkeley proposed the Spark platform [17] for cloudcomputing The Pregel framework for cloud environmentsis presented in [18] A comprehensive survey of most recentresearch and development in cloud-based service computingsolutions in research of genome informatics is available in[19 20] In research of genomic analysis there are a numberof cloud computing providers that offer cloud-based servicessolutions such as Google Cloud Genomics [3] Amazon [21]andMicrosoft Azure [22]The necessity for cloud computingfor genomic analysis has been well discussed by leadersin bioinformatics and computational biology [23] Galaxy[24] is a powerful web-based system for performing genomeanalysis tasks and one of many models that manage bioinfor-matics databases Galaxy built on Amazonrsquos Elastic Compute
Cloud (EC2) with CloudMan framework to assist researchersexecutes their own Galaxy server on cloud infrastructureHowever Galaxy has data storage and data manipulationbottlenecks for large datasets with capability of analyzingone sample at a time and does not utilize the elastic cloudcompute capabilities completely This drawback is caused byits dependence on a single shared file system A significantbottleneck has been reported in [25] when processing largedatasets across distributed compute resources Among all thecloud computing platforms available Hadoop MapReduce isby far the most popular choice due to its ease to tune opensource nature and acceptability by industry and academicorganizations In order to utilize the full potential of cloudinfrastructure public cloud service providers offer virtualizedresources in terms of both hardware and software [26] Thevirtualization enables user specific customization flexibilityapplication execution environments and cost efficiency andminimizes power consumption [27 28]
The Hadoop framework [29] adopts a MapReduce modelfor computing a user specific application on cloud platformsIn the MapReduce model a dual phase execution approachis adopted In the initial phase input data to be processed issplit into chunks Each chunk is associated with a mapper ora map worker that provides ⟨Key | Value⟩ pairs as outputsThe outputs values are sorted on the basis of the Key valuesassociatedThe sorted values are provided to reduce workersthat is ⟨Key | SortedList(Value)⟩ Reduce workers store theresults inHadoop distributed file systemThemap and reduceworkers are generally virtual machines (VMs) in public cloudenvironments A simple MapReduce model deployed on theVM based computing environment is shown in Figure 1
Next-Generation Sequencing (NGS) tools like CloudA-ligner [13] Cloud Burst [30] SeqMapReduce [31] and Cross-bow [32] adopt the Hadoop frameworkThemajor drawbackof these alignment tools is that they are short read alignersThe short read aligners prove to be efficient when single-gapor ungapped alignment is to be computed Considering longreads these alignment algorithms exhibit performance deg-radation and affect accuracy proved by the results presentedin [10] For alignment considering long sequence readsoptimization of the BLAST algorithm and its deployment onHadoop platform is presented in [33] For long read alignersthe BWA-SW algorithm in [10] has been found to be efficientand suitable In [34] the Bwasw-Cloud algorithm consider-ing Hadoop platform is presented The Bwasw-Cloud adoptsBWA-SW algorithm for alignment Based on the literaturereviewed it is evident that limited work has been carried outconsidering long read aligners required to support analysis ofgenomic data generated from current and future sequencingtechnologies There is an increasingly urgent need to pro-vide a reliable and scalable support for the ever-increasingconvergence of NGS data However there is clearly a lackof standardized and affordable NGS management solutionson the cloud to support the growing needs of translationalgenomics research [20] This is a major motivating factor forthe authors of this paper
All the existing long read aligners for cloud environmentsconsider the Hadoop framework Genome alignment is an
BioMed Research International 3
Split
Merge
MergeMerge
Inpu
t
Out
put
Map
Map
Partition
Partition
Reduce
Reduce
Worker 1 on VM Worker 1 on VM
Worker N on VM Worker M on VM
Map stage Reduce stage
Cloud storage
Cloud storage
VM based virtualized computing environment
Cloud storage
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 MapReduce model deployed using VM based computing environment
iterative process and Hadoop incurs considerable perfor-mance overheads when iterative applications are hosted onthe framework [18 35]The aligners presented in [33 34] needto rely on multiway joins as genomic sequences are split andneeded to be merged prior to the reduce phase The perfor-mance of Hadoop suffers whenmultiway joins are considered[36]TheHadoop framework considers sequential processingof the map followed by reduce stage that affects performance[37]
Therefore in this paper we present Burrows-WheelerAlignerrsquos Smith-Waterman Alignment on Parallel MapRe-duce (BWASW-PMR) to perform long read alignments ona cloud platform BWASW-PMR adopts the BWA-SW pre-sented in [10] to perform long read alignments of genomicdata MapReduce model is considered for the execution onthe public cloud environment The Smith-Waterman (SW)algorithm [38 39] in BWA-SW is optimized using paral-lel computation technique To overcome the drawbacks ofHadoopMapReduce a parallel execution strategy of the mapand reduce workers is considered in BWASW-PMRThemapand reduce functions executions on the VM based workernodes are parallelized to reduce execution time The alignerpresented in [34] bears the closest similarity to the BWASW-PMR proposed and is considered for comparison The majorcontribution of our work can be summarized here
(i) Optimization of Smith-Waterman in the BWA-SWlong read sequence alignment
(ii) A custom MapReduce framework to support therequired computations for long sequence alignment
(iii) Parallel map and reduce workers execution strategy
(iv) Parallel execution of the map and reduce functions atworker nodes
This paper is organized as follows Section 2 presents relatedwork background Section 3 discusses the proposed Burrows-Wheeler Alignerrsquos Smith-Waterman Alignment on the Par-allel MapReduce BWASW-PMR The SW algorithm and itsconsidered optimization are also presented in Section 3 Theresults and the experimental study are shown in Section 4Comparison notes with related work counterparts are dis-cussed in the penultimate section The concluding remarksand future work are discussed in the last section
2 Background
21 Smith-Waterman (SW) Algorithm Burrows-WheelerAlignerrsquos Smith-Waterman (BWA-SW) Alignment relies onSW algorithm to align the seed matches of sequences Let 119902
0
and 1199021represent two genomic sequences obtained from the
seed matches The Smith-Waterman algorithm computes thesimilarity matrix score initially Using the similarity matrixand backtracking technique optimal alignment is obtainedLet X represent the size of 119902
0and Y represents the size of
1199021 For sequence 119902
0 there are X + 1 possible prefixes and
for 1199021 there areY + 1 possible prefixes including the empty
sequence Let 119902seq[1 Y] denote a prefix of Y charactersand 119902seq[X] represents the Xth character of a sequence119902seq The similarity score between prefixes 119902
0[1 119886] and
1199021[1 119887] is represented as 119902
119886119887 The similarity matrix is
denoted by 119885 and its size is (X + 1Y + 1) The first row and
4 BioMed Research International
column of 119885 are initialized to 0 The remaining elements of119885 (indexed by (119886 119887)) are computed using
119885119886119887
= max
0
119868119886119887
119869119886119887
119885119886minus1119887minus1
+119877 (119886 119887)
(1)
where 119877(119886 119887) is a function providing values of exact match ormismatch that is if 119902
0[119886] = 119902
1[119887] then119877(119886 119887) = 119883
119894(identical
characters among sequences 1199020and 1199021) otherwise 119877(119886 119887) =
119872119894(if characters among sequences 119902
0and 119902
1are unique) 119868
and 119869 represent thematrices in accordance with the affine gapmodel The matrices 119868 and 119869 are used to determine the gapsand are defined as
119868119886119887
= max
119868119886119887minus1
minus 119866119864
119885119886119887minus1
minus 119866119865
119869119886119887
= max
119869119886minus1119887
minus 119866119864
119885119886minus1119887
minus 119866119865
(2)
where 119866119864and 119866
119865represent the first and successive gap
penaltyBacktracking algorithm is used to find the optimal align-
ment between 1199020and 1199021 The backtracking begins with a cell
in 119885 that holds the highest score and proceeds till a zerovalued cell is reached Smith-Waterman algorithm providesoptimal alignments and similarity scores It must be notedthat computation ofmatrix119885 is bounded by the running timeof the slowest 119885 task due to the dependencies it exhibits
22 Burrows-Wheeler Alignerrsquos Smith-Waterman (BWA-SW)Alignment Algorithm BWA-SW algorithm constructs afull-text index in minute space (FM-index) [40] of the querysequence 119876 and the reference sequence 119877 A prefix directedacyclic word graph (119875119903119890119891119894119909 119863119860119882119866) is built using sequence119876 119901119903119890119891119894119909 119905119903119894119890 is built using the sequence 119877 The prefix trie isa tree representation of sequence 119877 The tree is constructedby concatenating all edge symbols from any node in the graphas a route to the root node providing a unique string Theunique string obtained is a substring of the sequence 119877 Eachnode of the tree is represented using 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897Traversing from nodes generates strings that are lexico-graphically sorted A node in 119901119903119890119891119894119909 119905119903119894119890 represents a stringThe 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 of the node and the string itrepresents are considered to be equivalent to each otherFrom 119901119903119890119891119894119909 119905119903119894119890 nodes exhibiting identical or similar119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 are collapsed to form 119875119903119890119891119894119909 119863119860119882119866Each node of 119875119903119890119891119894119909 119863119860119882119866 established represents oneor more substrings of 119877 Figure 2 represents 119901119903119890119891119894119909 119905119903119894119890119875119903119890119891119894119909 119863119860119882119866 and 119904119906119891119891119894119909 119886119903119903119886119910 of the input stringldquoBANANA$rdquo
Let PT(119883) and DG(119883) represent two functions thatare used to form 119901119903119890119891119894119909 119905119903119894119890 and 119875119903119890119891119894119909 119863119860119882119866 graph In
the BWA-SA algorithm PT(119877) and DG(119876) are initially com-puted Let 119886 be the root node of PT(119877) and 119887 represents theroot node of DG(119876) The best score between the sequences119876 and 119877 is computed using a dynamic programming mecha-nism Initialize 119864
119886119887= 119865119886119887= 119870119886119887= null considering the root
nodes of PT(119877) and DG(119876) For each of the parents that is119886119901 of the 119886th node in DG(119876) computation of 119865
119886119887|119886119901 119870119886119887|119886119901
and 119864
119886119887|119886119901is considered The computation of 119865
119886119887|119886119901is
119865119886119887|119886119901
= max 119865119886119901119887 119864119886119901119887
minus 119892119900 minus 119892119890 (3)
where 119892119900 is the gap open penalty and 119892119890 is the gap extensionpenalty119870
119886119887|119886119901is computed using
119870119886119887|119886119901
= max 119870119886119887119901 119864119886119887119901
minus 119892119900 minus 119892119890 (4)
where 119887119901is the parent of the 119887th node in PT(119877) The compu-
tation of 119864119886119887|119886119901
is achieved using
119864119886119887|119886119901
= max 119864119886119901119887119901
+ 119874 (119886119901 119886 119887119901 119887) 119865
119886119901119887119901 119870119886119901119887119901
0 (5)
where119874(119886119901 119886 119887119901 119887) represents the score between the symbol
on edge (119886119901 119886) and (119887
119901 119887) The computation of 119864
119886119887 119865119886119887 and
119870119886119887is defined as
(119864119886119887 119865119886119887 119870119886119887)
=
(119864119886119887|1198861015840 119865119886119887|1198861015840 119870119886119887|1198861015840) When 119864
119886119887|1198861015840 gt 0
(minusinfin minusinfin minusinfin) all other cases
(6)
where 1198861015840 = argmax119886119901isin119886119903(119886)
119864119886119887|119886119901
and 119886119903(119886) is a set containingparent nodes of 119886 Variable 119864
119886119887represents the best match-
ing score between the substrings that is substring 119886 andsubstring 119887 Consider 119864
119886119887gt 0 when the substrings 119887 and
119886 match It is known that the Smith-Waterman algorithmprovides accurate alignment results but requires large com-putation time To optimize the computations the reversepostorder traversal scheme on DG(119883) and PT(119883) is adoptedThe dynamic programming mechanism in the BWA-SWalgorithm enables identifying the seed matches of thegenomic sequences Based on the partial matches the Smith-Waterman Alignment is considered to the extended matches119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903 that is (119886 119887) is formed when the bestscore that is 119864
119886119887 is high and 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 size of 119887
is within the threshold setThe seedmatches are derived from119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903119904 by analyzing the suffix array of sequences119876 and119877Matching of the extended seedmatches using Smith-Waterman algorithm is considered only when high matchingscores are observed
3 Proposed Burrows-Wheeler AlignerrsquosSmith-Waterman Alignment on the ParallelMapReduce BWASW-PMR
BWASW-PMR provides a cloud platform to perform longread alignments considering genomic data obtained from
BioMed Research International 5
A NB$
$ N
$ NA
A
A
$
A
$ N
A
$
N
A
N
A
$
(a)
A NABANANA$
$ NA $ NA$
NA$$
(b)
B A N A N A $
NA
$
$
$
(c)
0 BANANA$
1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
String sorting
0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$
4 0 BANANA5 4 NA$BAN6 2 NANA$B
N
NB
$
A
A
A
(d)
Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence
NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs
31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR
is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876
1015840
chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876
1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide
6 BioMed Research International
ChunkR
998400
Local memory1 2
Local memory1 2
Mapoutput
data
Mapoutput
data
Temporary cloud storageMap workers Cloud storageReduce workers
Shuffle Sort Reduce
Shuffle Sort Reduce
Chunk 1
Chunk 2
Burrows-Wheeler Alignerrsquos Smith-
Burrows-Wheeler Alignerrsquos Smith-
Smith-Waterman parallelized
Smith-Waterman parallelized
BWASW-PMR master node
Sequence alignment
results
Query genome Q
Reference genome R
Parallel execution platform
Parallel execution platform
Q998400
Q998400
Cloud storage
Waterman (BWA-SW) Alignment
Waterman (BWA-SW) Alignment
Virtual machinemdashmap worker w
Virtual machinemdashmap worker 1
Virtual machinemdashreduce worker w
Virtual machinemdashreduce worker 1
Figure 3 BWASW-PMR cloud model for long read sequence alignment
alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as
T = TVM Config +TMAP +TREDUCE (7)
The overview of the BWASW-PMR cloud platform is shownin Figure 3
In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877
1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t
119876119868represent
the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism
The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection
Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as
T119908 MAP = (tGet Data M + t
119876119868+
1198761015840times tBWASW119901
+ tStore Data M)
(8)
The average execution time of all the119908mapworker nodesis defined as
TMAP =sum119908
119894=1T119894 MAP
119908
(9)
32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle
BioMed Research International 7
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(a)
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(b)
Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR
function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas
T119908 REDUCE = (tGet Data R +
tFn R119901
+ tStore Data R) (10)
The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as
TREDUCE =sum119908
119894=1T119894 REDUCE119908
(11)
Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as
T = TVM Config +sum119908
119894=1(T119894 MAP +T
119894 REDUCE)
119908
(12)
From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by
the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan
T asymp
sum119908
119894=1(T119894 MAP)
119908
(13)
Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT
33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885
34(shown as black lines in
Figure 4(a)) it requires that the computation of1198852311988524 and
11988533
must be completed Based on the sequences119876 and119877 andvalues of 119885
23 11988524 and 119885
33 the value of 119885
34is obtained
The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion
8 BioMed Research International
Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference
Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp
The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP
4 Evaluation Experiments
In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available
41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1
Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm
Number of bp considered for alignment
Sequence alignmentmdashexecution time
SWSW-optimized
01
1
10
100
Log
scal
ed ex
ecut
ion
time (
s)
sim5 k sim8 k sim10 k sim15 k
Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths
42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
2 BioMed Research International
(1) Short read aligners used to align genomic sequenceswhose read length is between 32 bp and 200 bp thatis BWA [4] and Bowtie2 [5]
(2) Long read aligners used to align genomic sequenceswhose length is greater than 1000 bp that is Illumina
Examples of short read aligners include FASTA [6]BLAST [7] BLAT [8] SOAP [9] and Burrows-WheelerAligner (BWA) [4] For long read alignment researchershave proposed Burrows-Wheeler Alignerrsquos Smith-Waterman(BWA-SW) Alignment [10] Bowtie2 [5] Cushaw2 [11] andBWA-MEM [12]
Existing bioinformatics applications require both man-agement of huge amounts of data and heavy computationfor analysis Nearly 3 billion US dollars and 10 years wererequired to produce the initial human reference genomecontaining about 35 billion base pairs The latest develop-ments in sequencing technologies available produce enor-mous amount of data in terms of gigabytes per run [13]In NGS large sample size applications users are required towait for sufficient resources to become available in which thetime needed to complete processing becomes unpredictableAnalyzing such enormous amount of data and its storageare problems that exist To address these high computa-tion needs grid or cluster computing platforms have beenprovided to researchers [14] The provided grid or clustercomputation platforms are constrained by hardware capacityand concurrent access capacity to support multiple usersThe ever growing gap between computing capabilities andsequencing throughput is presented in [15]
Using cloud platforms one can solve the storage andcomputing problems that exist in gene sequencing Cloudcomputing platforms provide flexibility scalability and on-demand access to resources Moreover adoption of cloudcomputing technologies eliminates the costs incurred inestablishing maintaining large physical storage systemsand computing clusters Cloud users pay for the uti-lized storage and computing resources without worryingabout maintenance availability and reliability related issuesAlthough cloud computing provides scalable and flexibleinfrastructure parallel computing models on cloud plat-formsinfrastructure are required to achieve the desired goalof analyzing gene sequences One such framework for thecloud called MapReduce model [16] was introduced byGoogle Another model developed by University of Califor-nia at Berkeley proposed the Spark platform [17] for cloudcomputing The Pregel framework for cloud environmentsis presented in [18] A comprehensive survey of most recentresearch and development in cloud-based service computingsolutions in research of genome informatics is available in[19 20] In research of genomic analysis there are a numberof cloud computing providers that offer cloud-based servicessolutions such as Google Cloud Genomics [3] Amazon [21]andMicrosoft Azure [22]The necessity for cloud computingfor genomic analysis has been well discussed by leadersin bioinformatics and computational biology [23] Galaxy[24] is a powerful web-based system for performing genomeanalysis tasks and one of many models that manage bioinfor-matics databases Galaxy built on Amazonrsquos Elastic Compute
Cloud (EC2) with CloudMan framework to assist researchersexecutes their own Galaxy server on cloud infrastructureHowever Galaxy has data storage and data manipulationbottlenecks for large datasets with capability of analyzingone sample at a time and does not utilize the elastic cloudcompute capabilities completely This drawback is caused byits dependence on a single shared file system A significantbottleneck has been reported in [25] when processing largedatasets across distributed compute resources Among all thecloud computing platforms available Hadoop MapReduce isby far the most popular choice due to its ease to tune opensource nature and acceptability by industry and academicorganizations In order to utilize the full potential of cloudinfrastructure public cloud service providers offer virtualizedresources in terms of both hardware and software [26] Thevirtualization enables user specific customization flexibilityapplication execution environments and cost efficiency andminimizes power consumption [27 28]
The Hadoop framework [29] adopts a MapReduce modelfor computing a user specific application on cloud platformsIn the MapReduce model a dual phase execution approachis adopted In the initial phase input data to be processed issplit into chunks Each chunk is associated with a mapper ora map worker that provides ⟨Key | Value⟩ pairs as outputsThe outputs values are sorted on the basis of the Key valuesassociatedThe sorted values are provided to reduce workersthat is ⟨Key | SortedList(Value)⟩ Reduce workers store theresults inHadoop distributed file systemThemap and reduceworkers are generally virtual machines (VMs) in public cloudenvironments A simple MapReduce model deployed on theVM based computing environment is shown in Figure 1
Next-Generation Sequencing (NGS) tools like CloudA-ligner [13] Cloud Burst [30] SeqMapReduce [31] and Cross-bow [32] adopt the Hadoop frameworkThemajor drawbackof these alignment tools is that they are short read alignersThe short read aligners prove to be efficient when single-gapor ungapped alignment is to be computed Considering longreads these alignment algorithms exhibit performance deg-radation and affect accuracy proved by the results presentedin [10] For alignment considering long sequence readsoptimization of the BLAST algorithm and its deployment onHadoop platform is presented in [33] For long read alignersthe BWA-SW algorithm in [10] has been found to be efficientand suitable In [34] the Bwasw-Cloud algorithm consider-ing Hadoop platform is presented The Bwasw-Cloud adoptsBWA-SW algorithm for alignment Based on the literaturereviewed it is evident that limited work has been carried outconsidering long read aligners required to support analysis ofgenomic data generated from current and future sequencingtechnologies There is an increasingly urgent need to pro-vide a reliable and scalable support for the ever-increasingconvergence of NGS data However there is clearly a lackof standardized and affordable NGS management solutionson the cloud to support the growing needs of translationalgenomics research [20] This is a major motivating factor forthe authors of this paper
All the existing long read aligners for cloud environmentsconsider the Hadoop framework Genome alignment is an
BioMed Research International 3
Split
Merge
MergeMerge
Inpu
t
Out
put
Map
Map
Partition
Partition
Reduce
Reduce
Worker 1 on VM Worker 1 on VM
Worker N on VM Worker M on VM
Map stage Reduce stage
Cloud storage
Cloud storage
VM based virtualized computing environment
Cloud storage
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 MapReduce model deployed using VM based computing environment
iterative process and Hadoop incurs considerable perfor-mance overheads when iterative applications are hosted onthe framework [18 35]The aligners presented in [33 34] needto rely on multiway joins as genomic sequences are split andneeded to be merged prior to the reduce phase The perfor-mance of Hadoop suffers whenmultiway joins are considered[36]TheHadoop framework considers sequential processingof the map followed by reduce stage that affects performance[37]
Therefore in this paper we present Burrows-WheelerAlignerrsquos Smith-Waterman Alignment on Parallel MapRe-duce (BWASW-PMR) to perform long read alignments ona cloud platform BWASW-PMR adopts the BWA-SW pre-sented in [10] to perform long read alignments of genomicdata MapReduce model is considered for the execution onthe public cloud environment The Smith-Waterman (SW)algorithm [38 39] in BWA-SW is optimized using paral-lel computation technique To overcome the drawbacks ofHadoopMapReduce a parallel execution strategy of the mapand reduce workers is considered in BWASW-PMRThemapand reduce functions executions on the VM based workernodes are parallelized to reduce execution time The alignerpresented in [34] bears the closest similarity to the BWASW-PMR proposed and is considered for comparison The majorcontribution of our work can be summarized here
(i) Optimization of Smith-Waterman in the BWA-SWlong read sequence alignment
(ii) A custom MapReduce framework to support therequired computations for long sequence alignment
(iii) Parallel map and reduce workers execution strategy
(iv) Parallel execution of the map and reduce functions atworker nodes
This paper is organized as follows Section 2 presents relatedwork background Section 3 discusses the proposed Burrows-Wheeler Alignerrsquos Smith-Waterman Alignment on the Par-allel MapReduce BWASW-PMR The SW algorithm and itsconsidered optimization are also presented in Section 3 Theresults and the experimental study are shown in Section 4Comparison notes with related work counterparts are dis-cussed in the penultimate section The concluding remarksand future work are discussed in the last section
2 Background
21 Smith-Waterman (SW) Algorithm Burrows-WheelerAlignerrsquos Smith-Waterman (BWA-SW) Alignment relies onSW algorithm to align the seed matches of sequences Let 119902
0
and 1199021represent two genomic sequences obtained from the
seed matches The Smith-Waterman algorithm computes thesimilarity matrix score initially Using the similarity matrixand backtracking technique optimal alignment is obtainedLet X represent the size of 119902
0and Y represents the size of
1199021 For sequence 119902
0 there are X + 1 possible prefixes and
for 1199021 there areY + 1 possible prefixes including the empty
sequence Let 119902seq[1 Y] denote a prefix of Y charactersand 119902seq[X] represents the Xth character of a sequence119902seq The similarity score between prefixes 119902
0[1 119886] and
1199021[1 119887] is represented as 119902
119886119887 The similarity matrix is
denoted by 119885 and its size is (X + 1Y + 1) The first row and
4 BioMed Research International
column of 119885 are initialized to 0 The remaining elements of119885 (indexed by (119886 119887)) are computed using
119885119886119887
= max
0
119868119886119887
119869119886119887
119885119886minus1119887minus1
+119877 (119886 119887)
(1)
where 119877(119886 119887) is a function providing values of exact match ormismatch that is if 119902
0[119886] = 119902
1[119887] then119877(119886 119887) = 119883
119894(identical
characters among sequences 1199020and 1199021) otherwise 119877(119886 119887) =
119872119894(if characters among sequences 119902
0and 119902
1are unique) 119868
and 119869 represent thematrices in accordance with the affine gapmodel The matrices 119868 and 119869 are used to determine the gapsand are defined as
119868119886119887
= max
119868119886119887minus1
minus 119866119864
119885119886119887minus1
minus 119866119865
119869119886119887
= max
119869119886minus1119887
minus 119866119864
119885119886minus1119887
minus 119866119865
(2)
where 119866119864and 119866
119865represent the first and successive gap
penaltyBacktracking algorithm is used to find the optimal align-
ment between 1199020and 1199021 The backtracking begins with a cell
in 119885 that holds the highest score and proceeds till a zerovalued cell is reached Smith-Waterman algorithm providesoptimal alignments and similarity scores It must be notedthat computation ofmatrix119885 is bounded by the running timeof the slowest 119885 task due to the dependencies it exhibits
22 Burrows-Wheeler Alignerrsquos Smith-Waterman (BWA-SW)Alignment Algorithm BWA-SW algorithm constructs afull-text index in minute space (FM-index) [40] of the querysequence 119876 and the reference sequence 119877 A prefix directedacyclic word graph (119875119903119890119891119894119909 119863119860119882119866) is built using sequence119876 119901119903119890119891119894119909 119905119903119894119890 is built using the sequence 119877 The prefix trie isa tree representation of sequence 119877 The tree is constructedby concatenating all edge symbols from any node in the graphas a route to the root node providing a unique string Theunique string obtained is a substring of the sequence 119877 Eachnode of the tree is represented using 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897Traversing from nodes generates strings that are lexico-graphically sorted A node in 119901119903119890119891119894119909 119905119903119894119890 represents a stringThe 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 of the node and the string itrepresents are considered to be equivalent to each otherFrom 119901119903119890119891119894119909 119905119903119894119890 nodes exhibiting identical or similar119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 are collapsed to form 119875119903119890119891119894119909 119863119860119882119866Each node of 119875119903119890119891119894119909 119863119860119882119866 established represents oneor more substrings of 119877 Figure 2 represents 119901119903119890119891119894119909 119905119903119894119890119875119903119890119891119894119909 119863119860119882119866 and 119904119906119891119891119894119909 119886119903119903119886119910 of the input stringldquoBANANA$rdquo
Let PT(119883) and DG(119883) represent two functions thatare used to form 119901119903119890119891119894119909 119905119903119894119890 and 119875119903119890119891119894119909 119863119860119882119866 graph In
the BWA-SA algorithm PT(119877) and DG(119876) are initially com-puted Let 119886 be the root node of PT(119877) and 119887 represents theroot node of DG(119876) The best score between the sequences119876 and 119877 is computed using a dynamic programming mecha-nism Initialize 119864
119886119887= 119865119886119887= 119870119886119887= null considering the root
nodes of PT(119877) and DG(119876) For each of the parents that is119886119901 of the 119886th node in DG(119876) computation of 119865
119886119887|119886119901 119870119886119887|119886119901
and 119864
119886119887|119886119901is considered The computation of 119865
119886119887|119886119901is
119865119886119887|119886119901
= max 119865119886119901119887 119864119886119901119887
minus 119892119900 minus 119892119890 (3)
where 119892119900 is the gap open penalty and 119892119890 is the gap extensionpenalty119870
119886119887|119886119901is computed using
119870119886119887|119886119901
= max 119870119886119887119901 119864119886119887119901
minus 119892119900 minus 119892119890 (4)
where 119887119901is the parent of the 119887th node in PT(119877) The compu-
tation of 119864119886119887|119886119901
is achieved using
119864119886119887|119886119901
= max 119864119886119901119887119901
+ 119874 (119886119901 119886 119887119901 119887) 119865
119886119901119887119901 119870119886119901119887119901
0 (5)
where119874(119886119901 119886 119887119901 119887) represents the score between the symbol
on edge (119886119901 119886) and (119887
119901 119887) The computation of 119864
119886119887 119865119886119887 and
119870119886119887is defined as
(119864119886119887 119865119886119887 119870119886119887)
=
(119864119886119887|1198861015840 119865119886119887|1198861015840 119870119886119887|1198861015840) When 119864
119886119887|1198861015840 gt 0
(minusinfin minusinfin minusinfin) all other cases
(6)
where 1198861015840 = argmax119886119901isin119886119903(119886)
119864119886119887|119886119901
and 119886119903(119886) is a set containingparent nodes of 119886 Variable 119864
119886119887represents the best match-
ing score between the substrings that is substring 119886 andsubstring 119887 Consider 119864
119886119887gt 0 when the substrings 119887 and
119886 match It is known that the Smith-Waterman algorithmprovides accurate alignment results but requires large com-putation time To optimize the computations the reversepostorder traversal scheme on DG(119883) and PT(119883) is adoptedThe dynamic programming mechanism in the BWA-SWalgorithm enables identifying the seed matches of thegenomic sequences Based on the partial matches the Smith-Waterman Alignment is considered to the extended matches119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903 that is (119886 119887) is formed when the bestscore that is 119864
119886119887 is high and 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 size of 119887
is within the threshold setThe seedmatches are derived from119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903119904 by analyzing the suffix array of sequences119876 and119877Matching of the extended seedmatches using Smith-Waterman algorithm is considered only when high matchingscores are observed
3 Proposed Burrows-Wheeler AlignerrsquosSmith-Waterman Alignment on the ParallelMapReduce BWASW-PMR
BWASW-PMR provides a cloud platform to perform longread alignments considering genomic data obtained from
BioMed Research International 5
A NB$
$ N
$ NA
A
A
$
A
$ N
A
$
N
A
N
A
$
(a)
A NABANANA$
$ NA $ NA$
NA$$
(b)
B A N A N A $
NA
$
$
$
(c)
0 BANANA$
1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
String sorting
0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$
4 0 BANANA5 4 NA$BAN6 2 NANA$B
N
NB
$
A
A
A
(d)
Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence
NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs
31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR
is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876
1015840
chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876
1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide
6 BioMed Research International
ChunkR
998400
Local memory1 2
Local memory1 2
Mapoutput
data
Mapoutput
data
Temporary cloud storageMap workers Cloud storageReduce workers
Shuffle Sort Reduce
Shuffle Sort Reduce
Chunk 1
Chunk 2
Burrows-Wheeler Alignerrsquos Smith-
Burrows-Wheeler Alignerrsquos Smith-
Smith-Waterman parallelized
Smith-Waterman parallelized
BWASW-PMR master node
Sequence alignment
results
Query genome Q
Reference genome R
Parallel execution platform
Parallel execution platform
Q998400
Q998400
Cloud storage
Waterman (BWA-SW) Alignment
Waterman (BWA-SW) Alignment
Virtual machinemdashmap worker w
Virtual machinemdashmap worker 1
Virtual machinemdashreduce worker w
Virtual machinemdashreduce worker 1
Figure 3 BWASW-PMR cloud model for long read sequence alignment
alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as
T = TVM Config +TMAP +TREDUCE (7)
The overview of the BWASW-PMR cloud platform is shownin Figure 3
In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877
1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t
119876119868represent
the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism
The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection
Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as
T119908 MAP = (tGet Data M + t
119876119868+
1198761015840times tBWASW119901
+ tStore Data M)
(8)
The average execution time of all the119908mapworker nodesis defined as
TMAP =sum119908
119894=1T119894 MAP
119908
(9)
32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle
BioMed Research International 7
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(a)
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(b)
Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR
function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas
T119908 REDUCE = (tGet Data R +
tFn R119901
+ tStore Data R) (10)
The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as
TREDUCE =sum119908
119894=1T119894 REDUCE119908
(11)
Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as
T = TVM Config +sum119908
119894=1(T119894 MAP +T
119894 REDUCE)
119908
(12)
From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by
the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan
T asymp
sum119908
119894=1(T119894 MAP)
119908
(13)
Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT
33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885
34(shown as black lines in
Figure 4(a)) it requires that the computation of1198852311988524 and
11988533
must be completed Based on the sequences119876 and119877 andvalues of 119885
23 11988524 and 119885
33 the value of 119885
34is obtained
The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion
8 BioMed Research International
Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference
Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp
The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP
4 Evaluation Experiments
In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available
41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1
Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm
Number of bp considered for alignment
Sequence alignmentmdashexecution time
SWSW-optimized
01
1
10
100
Log
scal
ed ex
ecut
ion
time (
s)
sim5 k sim8 k sim10 k sim15 k
Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths
42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 3
Split
Merge
MergeMerge
Inpu
t
Out
put
Map
Map
Partition
Partition
Reduce
Reduce
Worker 1 on VM Worker 1 on VM
Worker N on VM Worker M on VM
Map stage Reduce stage
Cloud storage
Cloud storage
VM based virtualized computing environment
Cloud storage
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
middot middot middot
Figure 1 MapReduce model deployed using VM based computing environment
iterative process and Hadoop incurs considerable perfor-mance overheads when iterative applications are hosted onthe framework [18 35]The aligners presented in [33 34] needto rely on multiway joins as genomic sequences are split andneeded to be merged prior to the reduce phase The perfor-mance of Hadoop suffers whenmultiway joins are considered[36]TheHadoop framework considers sequential processingof the map followed by reduce stage that affects performance[37]
Therefore in this paper we present Burrows-WheelerAlignerrsquos Smith-Waterman Alignment on Parallel MapRe-duce (BWASW-PMR) to perform long read alignments ona cloud platform BWASW-PMR adopts the BWA-SW pre-sented in [10] to perform long read alignments of genomicdata MapReduce model is considered for the execution onthe public cloud environment The Smith-Waterman (SW)algorithm [38 39] in BWA-SW is optimized using paral-lel computation technique To overcome the drawbacks ofHadoopMapReduce a parallel execution strategy of the mapand reduce workers is considered in BWASW-PMRThemapand reduce functions executions on the VM based workernodes are parallelized to reduce execution time The alignerpresented in [34] bears the closest similarity to the BWASW-PMR proposed and is considered for comparison The majorcontribution of our work can be summarized here
(i) Optimization of Smith-Waterman in the BWA-SWlong read sequence alignment
(ii) A custom MapReduce framework to support therequired computations for long sequence alignment
(iii) Parallel map and reduce workers execution strategy
(iv) Parallel execution of the map and reduce functions atworker nodes
This paper is organized as follows Section 2 presents relatedwork background Section 3 discusses the proposed Burrows-Wheeler Alignerrsquos Smith-Waterman Alignment on the Par-allel MapReduce BWASW-PMR The SW algorithm and itsconsidered optimization are also presented in Section 3 Theresults and the experimental study are shown in Section 4Comparison notes with related work counterparts are dis-cussed in the penultimate section The concluding remarksand future work are discussed in the last section
2 Background
21 Smith-Waterman (SW) Algorithm Burrows-WheelerAlignerrsquos Smith-Waterman (BWA-SW) Alignment relies onSW algorithm to align the seed matches of sequences Let 119902
0
and 1199021represent two genomic sequences obtained from the
seed matches The Smith-Waterman algorithm computes thesimilarity matrix score initially Using the similarity matrixand backtracking technique optimal alignment is obtainedLet X represent the size of 119902
0and Y represents the size of
1199021 For sequence 119902
0 there are X + 1 possible prefixes and
for 1199021 there areY + 1 possible prefixes including the empty
sequence Let 119902seq[1 Y] denote a prefix of Y charactersand 119902seq[X] represents the Xth character of a sequence119902seq The similarity score between prefixes 119902
0[1 119886] and
1199021[1 119887] is represented as 119902
119886119887 The similarity matrix is
denoted by 119885 and its size is (X + 1Y + 1) The first row and
4 BioMed Research International
column of 119885 are initialized to 0 The remaining elements of119885 (indexed by (119886 119887)) are computed using
119885119886119887
= max
0
119868119886119887
119869119886119887
119885119886minus1119887minus1
+119877 (119886 119887)
(1)
where 119877(119886 119887) is a function providing values of exact match ormismatch that is if 119902
0[119886] = 119902
1[119887] then119877(119886 119887) = 119883
119894(identical
characters among sequences 1199020and 1199021) otherwise 119877(119886 119887) =
119872119894(if characters among sequences 119902
0and 119902
1are unique) 119868
and 119869 represent thematrices in accordance with the affine gapmodel The matrices 119868 and 119869 are used to determine the gapsand are defined as
119868119886119887
= max
119868119886119887minus1
minus 119866119864
119885119886119887minus1
minus 119866119865
119869119886119887
= max
119869119886minus1119887
minus 119866119864
119885119886minus1119887
minus 119866119865
(2)
where 119866119864and 119866
119865represent the first and successive gap
penaltyBacktracking algorithm is used to find the optimal align-
ment between 1199020and 1199021 The backtracking begins with a cell
in 119885 that holds the highest score and proceeds till a zerovalued cell is reached Smith-Waterman algorithm providesoptimal alignments and similarity scores It must be notedthat computation ofmatrix119885 is bounded by the running timeof the slowest 119885 task due to the dependencies it exhibits
22 Burrows-Wheeler Alignerrsquos Smith-Waterman (BWA-SW)Alignment Algorithm BWA-SW algorithm constructs afull-text index in minute space (FM-index) [40] of the querysequence 119876 and the reference sequence 119877 A prefix directedacyclic word graph (119875119903119890119891119894119909 119863119860119882119866) is built using sequence119876 119901119903119890119891119894119909 119905119903119894119890 is built using the sequence 119877 The prefix trie isa tree representation of sequence 119877 The tree is constructedby concatenating all edge symbols from any node in the graphas a route to the root node providing a unique string Theunique string obtained is a substring of the sequence 119877 Eachnode of the tree is represented using 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897Traversing from nodes generates strings that are lexico-graphically sorted A node in 119901119903119890119891119894119909 119905119903119894119890 represents a stringThe 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 of the node and the string itrepresents are considered to be equivalent to each otherFrom 119901119903119890119891119894119909 119905119903119894119890 nodes exhibiting identical or similar119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 are collapsed to form 119875119903119890119891119894119909 119863119860119882119866Each node of 119875119903119890119891119894119909 119863119860119882119866 established represents oneor more substrings of 119877 Figure 2 represents 119901119903119890119891119894119909 119905119903119894119890119875119903119890119891119894119909 119863119860119882119866 and 119904119906119891119891119894119909 119886119903119903119886119910 of the input stringldquoBANANA$rdquo
Let PT(119883) and DG(119883) represent two functions thatare used to form 119901119903119890119891119894119909 119905119903119894119890 and 119875119903119890119891119894119909 119863119860119882119866 graph In
the BWA-SA algorithm PT(119877) and DG(119876) are initially com-puted Let 119886 be the root node of PT(119877) and 119887 represents theroot node of DG(119876) The best score between the sequences119876 and 119877 is computed using a dynamic programming mecha-nism Initialize 119864
119886119887= 119865119886119887= 119870119886119887= null considering the root
nodes of PT(119877) and DG(119876) For each of the parents that is119886119901 of the 119886th node in DG(119876) computation of 119865
119886119887|119886119901 119870119886119887|119886119901
and 119864
119886119887|119886119901is considered The computation of 119865
119886119887|119886119901is
119865119886119887|119886119901
= max 119865119886119901119887 119864119886119901119887
minus 119892119900 minus 119892119890 (3)
where 119892119900 is the gap open penalty and 119892119890 is the gap extensionpenalty119870
119886119887|119886119901is computed using
119870119886119887|119886119901
= max 119870119886119887119901 119864119886119887119901
minus 119892119900 minus 119892119890 (4)
where 119887119901is the parent of the 119887th node in PT(119877) The compu-
tation of 119864119886119887|119886119901
is achieved using
119864119886119887|119886119901
= max 119864119886119901119887119901
+ 119874 (119886119901 119886 119887119901 119887) 119865
119886119901119887119901 119870119886119901119887119901
0 (5)
where119874(119886119901 119886 119887119901 119887) represents the score between the symbol
on edge (119886119901 119886) and (119887
119901 119887) The computation of 119864
119886119887 119865119886119887 and
119870119886119887is defined as
(119864119886119887 119865119886119887 119870119886119887)
=
(119864119886119887|1198861015840 119865119886119887|1198861015840 119870119886119887|1198861015840) When 119864
119886119887|1198861015840 gt 0
(minusinfin minusinfin minusinfin) all other cases
(6)
where 1198861015840 = argmax119886119901isin119886119903(119886)
119864119886119887|119886119901
and 119886119903(119886) is a set containingparent nodes of 119886 Variable 119864
119886119887represents the best match-
ing score between the substrings that is substring 119886 andsubstring 119887 Consider 119864
119886119887gt 0 when the substrings 119887 and
119886 match It is known that the Smith-Waterman algorithmprovides accurate alignment results but requires large com-putation time To optimize the computations the reversepostorder traversal scheme on DG(119883) and PT(119883) is adoptedThe dynamic programming mechanism in the BWA-SWalgorithm enables identifying the seed matches of thegenomic sequences Based on the partial matches the Smith-Waterman Alignment is considered to the extended matches119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903 that is (119886 119887) is formed when the bestscore that is 119864
119886119887 is high and 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 size of 119887
is within the threshold setThe seedmatches are derived from119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903119904 by analyzing the suffix array of sequences119876 and119877Matching of the extended seedmatches using Smith-Waterman algorithm is considered only when high matchingscores are observed
3 Proposed Burrows-Wheeler AlignerrsquosSmith-Waterman Alignment on the ParallelMapReduce BWASW-PMR
BWASW-PMR provides a cloud platform to perform longread alignments considering genomic data obtained from
BioMed Research International 5
A NB$
$ N
$ NA
A
A
$
A
$ N
A
$
N
A
N
A
$
(a)
A NABANANA$
$ NA $ NA$
NA$$
(b)
B A N A N A $
NA
$
$
$
(c)
0 BANANA$
1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
String sorting
0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$
4 0 BANANA5 4 NA$BAN6 2 NANA$B
N
NB
$
A
A
A
(d)
Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence
NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs
31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR
is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876
1015840
chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876
1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide
6 BioMed Research International
ChunkR
998400
Local memory1 2
Local memory1 2
Mapoutput
data
Mapoutput
data
Temporary cloud storageMap workers Cloud storageReduce workers
Shuffle Sort Reduce
Shuffle Sort Reduce
Chunk 1
Chunk 2
Burrows-Wheeler Alignerrsquos Smith-
Burrows-Wheeler Alignerrsquos Smith-
Smith-Waterman parallelized
Smith-Waterman parallelized
BWASW-PMR master node
Sequence alignment
results
Query genome Q
Reference genome R
Parallel execution platform
Parallel execution platform
Q998400
Q998400
Cloud storage
Waterman (BWA-SW) Alignment
Waterman (BWA-SW) Alignment
Virtual machinemdashmap worker w
Virtual machinemdashmap worker 1
Virtual machinemdashreduce worker w
Virtual machinemdashreduce worker 1
Figure 3 BWASW-PMR cloud model for long read sequence alignment
alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as
T = TVM Config +TMAP +TREDUCE (7)
The overview of the BWASW-PMR cloud platform is shownin Figure 3
In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877
1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t
119876119868represent
the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism
The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection
Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as
T119908 MAP = (tGet Data M + t
119876119868+
1198761015840times tBWASW119901
+ tStore Data M)
(8)
The average execution time of all the119908mapworker nodesis defined as
TMAP =sum119908
119894=1T119894 MAP
119908
(9)
32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle
BioMed Research International 7
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(a)
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(b)
Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR
function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas
T119908 REDUCE = (tGet Data R +
tFn R119901
+ tStore Data R) (10)
The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as
TREDUCE =sum119908
119894=1T119894 REDUCE119908
(11)
Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as
T = TVM Config +sum119908
119894=1(T119894 MAP +T
119894 REDUCE)
119908
(12)
From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by
the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan
T asymp
sum119908
119894=1(T119894 MAP)
119908
(13)
Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT
33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885
34(shown as black lines in
Figure 4(a)) it requires that the computation of1198852311988524 and
11988533
must be completed Based on the sequences119876 and119877 andvalues of 119885
23 11988524 and 119885
33 the value of 119885
34is obtained
The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion
8 BioMed Research International
Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference
Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp
The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP
4 Evaluation Experiments
In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available
41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1
Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm
Number of bp considered for alignment
Sequence alignmentmdashexecution time
SWSW-optimized
01
1
10
100
Log
scal
ed ex
ecut
ion
time (
s)
sim5 k sim8 k sim10 k sim15 k
Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths
42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
4 BioMed Research International
column of 119885 are initialized to 0 The remaining elements of119885 (indexed by (119886 119887)) are computed using
119885119886119887
= max
0
119868119886119887
119869119886119887
119885119886minus1119887minus1
+119877 (119886 119887)
(1)
where 119877(119886 119887) is a function providing values of exact match ormismatch that is if 119902
0[119886] = 119902
1[119887] then119877(119886 119887) = 119883
119894(identical
characters among sequences 1199020and 1199021) otherwise 119877(119886 119887) =
119872119894(if characters among sequences 119902
0and 119902
1are unique) 119868
and 119869 represent thematrices in accordance with the affine gapmodel The matrices 119868 and 119869 are used to determine the gapsand are defined as
119868119886119887
= max
119868119886119887minus1
minus 119866119864
119885119886119887minus1
minus 119866119865
119869119886119887
= max
119869119886minus1119887
minus 119866119864
119885119886minus1119887
minus 119866119865
(2)
where 119866119864and 119866
119865represent the first and successive gap
penaltyBacktracking algorithm is used to find the optimal align-
ment between 1199020and 1199021 The backtracking begins with a cell
in 119885 that holds the highest score and proceeds till a zerovalued cell is reached Smith-Waterman algorithm providesoptimal alignments and similarity scores It must be notedthat computation ofmatrix119885 is bounded by the running timeof the slowest 119885 task due to the dependencies it exhibits
22 Burrows-Wheeler Alignerrsquos Smith-Waterman (BWA-SW)Alignment Algorithm BWA-SW algorithm constructs afull-text index in minute space (FM-index) [40] of the querysequence 119876 and the reference sequence 119877 A prefix directedacyclic word graph (119875119903119890119891119894119909 119863119860119882119866) is built using sequence119876 119901119903119890119891119894119909 119905119903119894119890 is built using the sequence 119877 The prefix trie isa tree representation of sequence 119877 The tree is constructedby concatenating all edge symbols from any node in the graphas a route to the root node providing a unique string Theunique string obtained is a substring of the sequence 119877 Eachnode of the tree is represented using 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897Traversing from nodes generates strings that are lexico-graphically sorted A node in 119901119903119890119891119894119909 119905119903119894119890 represents a stringThe 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 of the node and the string itrepresents are considered to be equivalent to each otherFrom 119901119903119890119891119894119909 119905119903119894119890 nodes exhibiting identical or similar119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 are collapsed to form 119875119903119890119891119894119909 119863119860119882119866Each node of 119875119903119890119891119894119909 119863119860119882119866 established represents oneor more substrings of 119877 Figure 2 represents 119901119903119890119891119894119909 119905119903119894119890119875119903119890119891119894119909 119863119860119882119866 and 119904119906119891119891119894119909 119886119903119903119886119910 of the input stringldquoBANANA$rdquo
Let PT(119883) and DG(119883) represent two functions thatare used to form 119901119903119890119891119894119909 119905119903119894119890 and 119875119903119890119891119894119909 119863119860119882119866 graph In
the BWA-SA algorithm PT(119877) and DG(119876) are initially com-puted Let 119886 be the root node of PT(119877) and 119887 represents theroot node of DG(119876) The best score between the sequences119876 and 119877 is computed using a dynamic programming mecha-nism Initialize 119864
119886119887= 119865119886119887= 119870119886119887= null considering the root
nodes of PT(119877) and DG(119876) For each of the parents that is119886119901 of the 119886th node in DG(119876) computation of 119865
119886119887|119886119901 119870119886119887|119886119901
and 119864
119886119887|119886119901is considered The computation of 119865
119886119887|119886119901is
119865119886119887|119886119901
= max 119865119886119901119887 119864119886119901119887
minus 119892119900 minus 119892119890 (3)
where 119892119900 is the gap open penalty and 119892119890 is the gap extensionpenalty119870
119886119887|119886119901is computed using
119870119886119887|119886119901
= max 119870119886119887119901 119864119886119887119901
minus 119892119900 minus 119892119890 (4)
where 119887119901is the parent of the 119887th node in PT(119877) The compu-
tation of 119864119886119887|119886119901
is achieved using
119864119886119887|119886119901
= max 119864119886119901119887119901
+ 119874 (119886119901 119886 119887119901 119887) 119865
119886119901119887119901 119870119886119901119887119901
0 (5)
where119874(119886119901 119886 119887119901 119887) represents the score between the symbol
on edge (119886119901 119886) and (119887
119901 119887) The computation of 119864
119886119887 119865119886119887 and
119870119886119887is defined as
(119864119886119887 119865119886119887 119870119886119887)
=
(119864119886119887|1198861015840 119865119886119887|1198861015840 119870119886119887|1198861015840) When 119864
119886119887|1198861015840 gt 0
(minusinfin minusinfin minusinfin) all other cases
(6)
where 1198861015840 = argmax119886119901isin119886119903(119886)
119864119886119887|119886119901
and 119886119903(119886) is a set containingparent nodes of 119886 Variable 119864
119886119887represents the best match-
ing score between the substrings that is substring 119886 andsubstring 119887 Consider 119864
119886119887gt 0 when the substrings 119887 and
119886 match It is known that the Smith-Waterman algorithmprovides accurate alignment results but requires large com-putation time To optimize the computations the reversepostorder traversal scheme on DG(119883) and PT(119883) is adoptedThe dynamic programming mechanism in the BWA-SWalgorithm enables identifying the seed matches of thegenomic sequences Based on the partial matches the Smith-Waterman Alignment is considered to the extended matches119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903 that is (119886 119887) is formed when the bestscore that is 119864
119886119887 is high and 119904119906119891119891119894119909 119886119903119903119886119910 119894119899119905119890119903V119886119897 size of 119887
is within the threshold setThe seedmatches are derived from119904119890119890119889 119894119899119905119890119903V119886119897 119901119886119894119903119904 by analyzing the suffix array of sequences119876 and119877Matching of the extended seedmatches using Smith-Waterman algorithm is considered only when high matchingscores are observed
3 Proposed Burrows-Wheeler AlignerrsquosSmith-Waterman Alignment on the ParallelMapReduce BWASW-PMR
BWASW-PMR provides a cloud platform to perform longread alignments considering genomic data obtained from
BioMed Research International 5
A NB$
$ N
$ NA
A
A
$
A
$ N
A
$
N
A
N
A
$
(a)
A NABANANA$
$ NA $ NA$
NA$$
(b)
B A N A N A $
NA
$
$
$
(c)
0 BANANA$
1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
String sorting
0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$
4 0 BANANA5 4 NA$BAN6 2 NANA$B
N
NB
$
A
A
A
(d)
Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence
NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs
31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR
is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876
1015840
chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876
1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide
6 BioMed Research International
ChunkR
998400
Local memory1 2
Local memory1 2
Mapoutput
data
Mapoutput
data
Temporary cloud storageMap workers Cloud storageReduce workers
Shuffle Sort Reduce
Shuffle Sort Reduce
Chunk 1
Chunk 2
Burrows-Wheeler Alignerrsquos Smith-
Burrows-Wheeler Alignerrsquos Smith-
Smith-Waterman parallelized
Smith-Waterman parallelized
BWASW-PMR master node
Sequence alignment
results
Query genome Q
Reference genome R
Parallel execution platform
Parallel execution platform
Q998400
Q998400
Cloud storage
Waterman (BWA-SW) Alignment
Waterman (BWA-SW) Alignment
Virtual machinemdashmap worker w
Virtual machinemdashmap worker 1
Virtual machinemdashreduce worker w
Virtual machinemdashreduce worker 1
Figure 3 BWASW-PMR cloud model for long read sequence alignment
alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as
T = TVM Config +TMAP +TREDUCE (7)
The overview of the BWASW-PMR cloud platform is shownin Figure 3
In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877
1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t
119876119868represent
the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism
The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection
Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as
T119908 MAP = (tGet Data M + t
119876119868+
1198761015840times tBWASW119901
+ tStore Data M)
(8)
The average execution time of all the119908mapworker nodesis defined as
TMAP =sum119908
119894=1T119894 MAP
119908
(9)
32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle
BioMed Research International 7
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(a)
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(b)
Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR
function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas
T119908 REDUCE = (tGet Data R +
tFn R119901
+ tStore Data R) (10)
The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as
TREDUCE =sum119908
119894=1T119894 REDUCE119908
(11)
Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as
T = TVM Config +sum119908
119894=1(T119894 MAP +T
119894 REDUCE)
119908
(12)
From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by
the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan
T asymp
sum119908
119894=1(T119894 MAP)
119908
(13)
Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT
33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885
34(shown as black lines in
Figure 4(a)) it requires that the computation of1198852311988524 and
11988533
must be completed Based on the sequences119876 and119877 andvalues of 119885
23 11988524 and 119885
33 the value of 119885
34is obtained
The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion
8 BioMed Research International
Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference
Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp
The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP
4 Evaluation Experiments
In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available
41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1
Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm
Number of bp considered for alignment
Sequence alignmentmdashexecution time
SWSW-optimized
01
1
10
100
Log
scal
ed ex
ecut
ion
time (
s)
sim5 k sim8 k sim10 k sim15 k
Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths
42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 5
A NB$
$ N
$ NA
A
A
$
A
$ N
A
$
N
A
N
A
$
(a)
A NABANANA$
$ NA $ NA$
NA$$
(b)
B A N A N A $
NA
$
$
$
(c)
0 BANANA$
1 ANANA$B2 NANA$BA3 ANA$BAN4 NA$BANA5 A$BANAN6 $BANANA
String sorting
0 6 $BANAN1 5 A$BANA2 3 ANA$BA3 1 ANANA$
4 0 BANANA5 4 NA$BAN6 2 NANA$B
N
NB
$
A
A
A
(d)
Figure 2 Example of reference in a prefix trie query in prefix DAWG and the suffix array of the input string ldquoBANANA$rdquo (a) representsreference prefix trie searching for any substring of the string by staring at the root and followingmatches down the tree until being exhausted(b) represents prefix DAWG constructed by collapsing nodes with the identical suffix array interval DAWG transformed from the prefix trieof the query sequence (c) represents a constructed string suffix automaton (DAWG) (d) represents the suffix array for string ldquoBANANA$rdquoThe dollar sign $ is a regular expression that denotes the end of a line in the reference sequence
NGS techniques The BWASW-PMR adopts the MapReducecomputation model for cloud computation To support scal-ability in BWASW-PMR map and reduce worker nodes aredeployed on a cloud cluster consisting of VMs For long readalignments the BWA-SW algorithm is adopted The advan-tages of adopting the BWA-SW algorithm for long readalignments and its advantages over the existing aligners arefound in [10] The BWASW-PMR considers the genomicsequence alignment in dual phases that is map and reducephase Existing long read aligners for cloud platforms adoptthe Hadoop framework In the Hadoop framework basedsolutions the map phase is executed and then the reducephase is initiated To overcome the drawbacks of Hadoop aparallel execution strategy of the map and reduce phases isconsidered Optimization of the Smith-Waterman algorithmis an additional feature considered in BWASW-PMR Execu-tion of the map and reduce functions is modelled to run inparallel utilizing all programming cores available in workerVMs
31 Overview and Preliminaries Let 119877 represent a referencegenomic sequence and119876 the query sequence BWASW-PMR
is deployed on a cloud platform comprising a master nodemap worker nodes and reduce worker nodes The masternode of BWASW-PMR initializes 119908 map and reduce workernodes using virtual computing nodes Each computing nodeorVM is assumed to have119901CPUcores available for computa-tion LetTVM Config represent the time taken to initialize thevirtual computing platform The sequence 119877 is split into 1198771015840chunks with overlapping sections (the overlapping sectionsare depicted in grey in Figure 2) The reference chunk and 119876are sent to map worker nodes as input key value pairs Thekey value pairs considering the chunk 1198771015840 are represented as(119896119903 V119903) where 119896119903 is a key and V119903 contains the overlappingoffset data The key value pair of 119876 is represented as (119896119902 V119902)where 119896119902 is a key and V119902 is the query sequence In eachof the 119908 map workers sequence 119876 is further split into 119876
1015840
chunks and stored in the local memory available Alignmentis performed using the BWA-SW algorithm considering 1198771015840and each 119876
1015840 in a parallel fashion utilizing the 119901 coresThe Smith-Waterman algorithm in BWA-SW is optimizedby a parallelization technique to reduce execution time LetTMAP represent the average execution time of the 119908 mapworker nodes The map workers after computation provide
6 BioMed Research International
ChunkR
998400
Local memory1 2
Local memory1 2
Mapoutput
data
Mapoutput
data
Temporary cloud storageMap workers Cloud storageReduce workers
Shuffle Sort Reduce
Shuffle Sort Reduce
Chunk 1
Chunk 2
Burrows-Wheeler Alignerrsquos Smith-
Burrows-Wheeler Alignerrsquos Smith-
Smith-Waterman parallelized
Smith-Waterman parallelized
BWASW-PMR master node
Sequence alignment
results
Query genome Q
Reference genome R
Parallel execution platform
Parallel execution platform
Q998400
Q998400
Cloud storage
Waterman (BWA-SW) Alignment
Waterman (BWA-SW) Alignment
Virtual machinemdashmap worker w
Virtual machinemdashmap worker 1
Virtual machinemdashreduce worker w
Virtual machinemdashreduce worker 1
Figure 3 BWASW-PMR cloud model for long read sequence alignment
alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as
T = TVM Config +TMAP +TREDUCE (7)
The overview of the BWASW-PMR cloud platform is shownin Figure 3
In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877
1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t
119876119868represent
the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism
The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection
Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as
T119908 MAP = (tGet Data M + t
119876119868+
1198761015840times tBWASW119901
+ tStore Data M)
(8)
The average execution time of all the119908mapworker nodesis defined as
TMAP =sum119908
119894=1T119894 MAP
119908
(9)
32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle
BioMed Research International 7
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(a)
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(b)
Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR
function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas
T119908 REDUCE = (tGet Data R +
tFn R119901
+ tStore Data R) (10)
The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as
TREDUCE =sum119908
119894=1T119894 REDUCE119908
(11)
Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as
T = TVM Config +sum119908
119894=1(T119894 MAP +T
119894 REDUCE)
119908
(12)
From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by
the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan
T asymp
sum119908
119894=1(T119894 MAP)
119908
(13)
Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT
33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885
34(shown as black lines in
Figure 4(a)) it requires that the computation of1198852311988524 and
11988533
must be completed Based on the sequences119876 and119877 andvalues of 119885
23 11988524 and 119885
33 the value of 119885
34is obtained
The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion
8 BioMed Research International
Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference
Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp
The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP
4 Evaluation Experiments
In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available
41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1
Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm
Number of bp considered for alignment
Sequence alignmentmdashexecution time
SWSW-optimized
01
1
10
100
Log
scal
ed ex
ecut
ion
time (
s)
sim5 k sim8 k sim10 k sim15 k
Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths
42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
6 BioMed Research International
ChunkR
998400
Local memory1 2
Local memory1 2
Mapoutput
data
Mapoutput
data
Temporary cloud storageMap workers Cloud storageReduce workers
Shuffle Sort Reduce
Shuffle Sort Reduce
Chunk 1
Chunk 2
Burrows-Wheeler Alignerrsquos Smith-
Burrows-Wheeler Alignerrsquos Smith-
Smith-Waterman parallelized
Smith-Waterman parallelized
BWASW-PMR master node
Sequence alignment
results
Query genome Q
Reference genome R
Parallel execution platform
Parallel execution platform
Q998400
Q998400
Cloud storage
Waterman (BWA-SW) Alignment
Waterman (BWA-SW) Alignment
Virtual machinemdashmap worker w
Virtual machinemdashmap worker 1
Virtual machinemdashreduce worker w
Virtual machinemdashreduce worker 1
Figure 3 BWASW-PMR cloud model for long read sequence alignment
alignment locations between 119876 and chunk 1198771015840 along with thescore Multiple alignment locations and scores (ie V119898 alongthe chunk id of 1198771015840 and 119896119898) are stored in the temporary cloudmemory as list(119896119898 V119898) The map function of BWASW-PMRcan be defined as map((119896119903 V119903) (119896119902 V119902)) rarr list(119896119898 V119898)Reduce worker nodes 119908 obtain intermediate data that islist(119896119898 V119898) to perform the shuffle and sort function Inthe reduce phase aggregation of all alignment locationsthat is list(V119889) that are nonredundant and nonoverlappingis considered The reduce operation can be defined asreduce(119896119898 list(V119898)) rarr list(V119889) LetTREDUCE represent theaverage time required by 119908 reduce worker nodes to performthe shuffle sort and reduce computationThe totalmakespanof the BWASW-PMR cloud platform to align the sequence 119876against 119877 is computed as
T = TVM Config +TMAP +TREDUCE (7)
The overview of the BWASW-PMR cloud platform is shownin Figure 3
In the map phase of the BWASW-PMR cloud platformalignment locations and corresponding scores between119876 andchunk 119877
1015840 are computed The required data that is 119876 andchunk 1198771015840 is obtained from the cloud memory Let tGet Data Mrepresent the time taken to obtain the data To reducecomputation time using parallel computing techniques thegenomic sequence119876 is split into1198761015840 chunks Let t
119876119868represent
the time taken to split sequence 119876 into 1198761015840 chunks Sequencealignment considering one chunk of 1198761015840 and 1198771015840 is performedusing the BWA-SW algorithm DG(1198761015840) and PT(1198771015840) areinitially constructed The seed matches of sequences 1198761015840 and1198771015840 are computed using a dynamic programming mechanism
The seeds computed are extended to ensure rule alignment ofthe genomic sequences using the SW algorithm To reducecomputation time parallelization of the SW algorithm isconsidered in BWASW-PMR The parallelization techniqueto optimize computation is discussed in the latter subsection
Let tBWASW represent the time taken to align 1198761015840th chunkand 1198771015840 using the BWA-SW algorithmThe total time taken toalign the total 1198761015840 chunks is (1198761015840 times tBWASW) As 119901 computingcores are available with each worker node the parallelcomputation of 119901 number of chunks of 119876 is possible Thecomputation time considering all 1198761015840 chunks and utilizing119901 cores is defined as ((1198761015840 times tBWASW)119901) The alignmentlocations and scores obtained are stored in the cloud forreduce workers The time taken to store this data per mapworker is represented as tStore Data M The makespan of the119908th map worker node can be defined as
T119908 MAP = (tGet Data M + t
119876119868+
1198761015840times tBWASW119901
+ tStore Data M)
(8)
The average execution time of all the119908mapworker nodesis defined as
TMAP =sum119908
119894=1T119894 MAP
119908
(9)
32 Reduce Phase Themaster node in BWASW-PMR initial-izes119908 reduceworker nodes and119908mapworker nodes simulta-neouslyThis parallel initializationmechanism enables reduc-ing the total makespan T In the reduce phase the shuffle
BioMed Research International 7
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(a)
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(b)
Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR
function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas
T119908 REDUCE = (tGet Data R +
tFn R119901
+ tStore Data R) (10)
The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as
TREDUCE =sum119908
119894=1T119894 REDUCE119908
(11)
Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as
T = TVM Config +sum119908
119894=1(T119894 MAP +T
119894 REDUCE)
119908
(12)
From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by
the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan
T asymp
sum119908
119894=1(T119894 MAP)
119908
(13)
Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT
33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885
34(shown as black lines in
Figure 4(a)) it requires that the computation of1198852311988524 and
11988533
must be completed Based on the sequences119876 and119877 andvalues of 119885
23 11988524 and 119885
33 the value of 119885
34is obtained
The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion
8 BioMed Research International
Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference
Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp
The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP
4 Evaluation Experiments
In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available
41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1
Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm
Number of bp considered for alignment
Sequence alignmentmdashexecution time
SWSW-optimized
01
1
10
100
Log
scal
ed ex
ecut
ion
time (
s)
sim5 k sim8 k sim10 k sim15 k
Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths
42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 7
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(a)
Z11 Z12 Z13 Z14 Z15 Z16 Z17 Z18 Z19
Z21 Z22 Z23 Z24 Z25 Z26 Z27 Z28 Z29
Z31 Z32 Z33 Z34 Z35 Z36 Z37 Z38 Z39
Z41 Z42 Z43 Z44 Z45 Z46 Z47 Z48 Z49
Z51 Z52 Z53 Z54 Z55 Z56 Z57 Z58 Z59
(b)
Figure 4 Smith-Waterman algorithm (a) conventional computation technique and data dependencies and (b) wavefront based parallelizedtechnique adopted in BWASW-PMR
function obtains the intermediate data (produced by theworker nodes) from the cloud storage The alignment loca-tions obtained from the intermediate data are sorted basedon the offset data The time taken by the 119908th reduce workernode to obtain the intermediate data and perform shuffleand sort operations is represented as tGet Data R The reducefunction in BWASW-PMR is used to aggregate the alignmentlocations The overlapping and redundant alignments areneglected Parallelization of the reduce function is achievedby utilizing all 119901 computing cores available with workernodes Let tFn R119901 represent the time taken to compute thereduce function utilizing the 119901 computing cores The reducefunction provides the alignment results between sequences 119877and 119876 that are written to cloud storage for further analysisLet tStore Data R represent the time taken by the 119908th reduceworker node to write alignment results into the cloud storageThe makespan of the119908th reduce worker node can be definedas
T119908 REDUCE = (tGet Data R +
tFn R119901
+ tStore Data R) (10)
The average makespan of the 119908 reduce worker nodes that isTREDUCE is defined as
TREDUCE =sum119908
119894=1T119894 REDUCE119908
(11)
Using (11) and (9) the total makespan of the BWASW-PMRcan be defined as
T = TVM Config +sum119908
119894=1(T119894 MAP +T
119894 REDUCE)
119908
(12)
From (12) it can be observed thatT (the computing timeor makespan of BWASW-PMR) depends on the computationtime of map and reduce worker nodes The core alignmentsteps (based on the BWA-SW algorithm) are performed by
the map worker nodes that isTREDUCE ≪ TMAP The timetaken to initialize the map and reduce computing clustersbased on VMs that isTVM Config is dependent on the cloudplatform considered for deployment and can be neglectedTherefore it can be stated by the total makespan
T asymp
sum119908
119894=1(T119894 MAP)
119908
(13)
Optimizing the SW algorithm in BWA-SW is a possiblesolution to reduce the total makespanT
33 Smith-Waterman Algorithm Optimization In the SWalgorithm computation of the similarity matrix score thatis 119885 requires the maximum time Adopting a parallelizationtechnique for computation of 119885 is considered in BWASW-PMR Let us consider a query sequence 119876 and referencesequence 119877 whose sizes are 4 and 8 respectively Thereforethe matrix 119885 to be computed is of size (59) and is shown inFigure 4(a) Data dependencies that exist in computing eachelement of 119885 make the parallelization technique difficult toimplement For computation of 119885
34(shown as black lines in
Figure 4(a)) it requires that the computation of1198852311988524 and
11988533
must be completed Based on the sequences119876 and119877 andvalues of 119885
23 11988524 and 119885
33 the value of 119885
34is obtained
The computation of 119885 is more often done serially as shownthrough the grey arrows in Figure 4(a) To parallelize thecomputation of 119885 the adoption of CUDAGPU based tech-niques was considered by researchers [33 41] However theavailability and cost of such computing platforms on publicclouds are an issue Tomaximize resource utilization availablewith theVMworker node atminimal costs in BWASW-PMRwavefront based parallelization technique [42] is consideredThe parallel computation technique adopted is shown by greydiagonal lines in Figure 4(b) Computation of all cells of 119885that fall under the diagonal lines can be done in a parallelizedfashion
8 BioMed Research International
Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference
Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp
The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP
4 Evaluation Experiments
In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available
41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1
Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm
Number of bp considered for alignment
Sequence alignmentmdashexecution time
SWSW-optimized
01
1
10
100
Log
scal
ed ex
ecut
ion
time (
s)
sim5 k sim8 k sim10 k sim15 k
Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths
42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
8 BioMed Research International
Table 1 Information of the genome sequences used as queries considering equal section lengths from the Homo sapiens chromosome 15 asreference
Representation Query genome definition (NCBI accession number ) Lengthsim5 k Adelie penguin polyomavirus isolate AdPyV Crozier 2012 (NC 0261412) 4988 bpsim8 k Rhesus monkey papillomavirus (NC 0016781) 8028 bpsim10 k Xenopus laevis endogenous retrovirus Xen1 (NC 0109551) 10207 bpsim15 k Japanese eel endothelial cells-infecting virus (NC 0151231) 15131 bp
The parallelization technique adopted to compute 119885 inSW algorithm enables reducing the makespan of map workernodes that isTMAP
4 Evaluation Experiments
In this section we study the performance of BWASW-PMR cloud platform BWASW-PMR is developed using C++and CNet and is deployed on the Azure cloud platformBWASW-PMR adopts the BWA-SW algorithm with the opti-mization of SWalgorithm Experiments have been conductedto study the performance of the optimized SW algorithmThe performance of BWASW-PMR is comparedwith Bwasw-Cloud to perform long read alignments on the cloud plat-form All genomic data considered for the experiments areobtained from NCBI database [3] that is publically available
41 SW Optimization Analysis Optimization of the SWalgorithm in BWA-SW aligner for BWASW-PMR is a novelapproach considering cloud deployment To analyze perfor-mance of the optimized SW algorithm comparison withthe standard SW algorithm (hereafter referred to as theSW algorithm in this section) available within BWA-SW isconsidered The optimization is achieved using a wavefrontparallelization technique The optimized version of SW ishereafter referred to as ldquoSW-Optimizedrdquo in this section Uni-form SW parameter settings (ie gap penalty matching andmismatching score) are considered for analysis For analysisthe sequences 119877 and119876 of equal lengths are considered Equallength is considered to maximize the computations requiredfor alignments Section of Homo sapiens chromosome 15GRCh38p2 Primary Assembly (NC 00001510) is consid-ered as the reference 119877 Query sequences considered areobtained from the influenza virus database [43] The querysequences considered are summarized in Table 1
Execution time of SW-Optimized and SW algorithm isnoted for the four alignment pairs described The resultsobtained are graphically shown in Figure 4 A logarithmic(Base 10) representation of the execution is considered inFigure 5 As sequence lengths for alignment are increasedthe execution time of SW and SW-Optimized increases Theexecution time of SW-Optimized is lower when comparedto the considered serial SW algorithm All experiments areexecuted on aQuadCore Intel i7machine with 8GB of RAMAn average reduction in the execution time of about 918 isobserved Results obtained prove SW-Optimized consideredin BWASW-PMR outperforms the classical SW algorithm
Number of bp considered for alignment
Sequence alignmentmdashexecution time
SWSW-optimized
01
1
10
100
Log
scal
ed ex
ecut
ion
time (
s)
sim5 k sim8 k sim10 k sim15 k
Figure 5 Execution time (in log scalemdashBase 10) of SW and SW-Optimized considering varied genomic sequence lengths
42 Experiments considering BWASW-PMR Cloud andBwasw-Cloud Single Computing Node To evaluate theperformance of BWASW-PMR comparison with Bwasw-Cloud is considered Deployments of Bwasw-Cloud andBWASW-PMR are considered with one computing node(ie onemap worker and one reduce worker are considered)BWASW-PMR is deployed on the Azure cloud platformBwasw-Cloud is designed using the Hadoop frameworkApache Hadoop amp YARN 240 version is used in thedeployment of the Bwasw-Cloud Uniform configurationsof the computing nodes are considered in the deploymentsBakers yeast genomic database (ie Saccharomyces cerevisiaeS288c) is considered for evaluation [44] Experiments usinga reference genomic sequence and five query sequences ofvaried lengths are considered The experiments conductedwith the reference and query genomic sequences are summa-rized in Table 2 Makespan or total execution time is notedand the results obtained are shown in Figure 6 The resultsobtained prove that the proposed BWASW-PMR cloudaligner deployed on Azure outperforms Bwasw-Clouddeployed onHadoop In experiment 1 the speed-up achievedfor BWASW-PMR is about 45 For longer sequence align-ments that is in experiment 5 the speed-up was observedto be 75 As query length increases the performance ofBWASW-PMR improves An average speed-up of 67 isachieved considering BWASW-PMR when compared to theBwasw-Cloud
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 9
Table 2 Experiment information considered to compare the performance of BWASW-PMR with the Bwasw-Cloud
Number Reference genome (NCBI accessionnumber) Length Query genome (NCBI accession
number ) Length
1 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome V (BK0069392) 576874 bp
2 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XI (BK0069442) 666816 bp
3 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome X (BK0069432) 745751 bp
4 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome II (BK0069362) 813184 bp
5 TPA inf Saccharomyces cerevisiae S288cchromosome IV (BK0069382) 1531933 bp TPA inf Saccharomyces cerevisiae
S288c chromosome XVI (BK0069492) 948066 bp
1 2 3 4 5Experiment number
Long read alignmentmdashmakespan time
35
55
75
95
115
135
155
175
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Figure 6 Makespan observations considering 5 long read sequencealignment experiments
43 Experiments considering BWASW-PMR Cloud andBwasw-Cloud on Azure This section discusses public clouddeployments and performance analysis of BWASW-PMRagainst Bwasw-Cloud Deployment of BWASW-PMR on theAmazon cloud andAzure cloud is possible Here deploymentof BWASW-PMR on the Azure cloud is considered andpresented The deployed BWASW-PMR considers A3 VMinstances Each A3 VM instance consists of 4 computingcores 7 GB of RAM and 120GB of local hard drive spaceThe deployment of Azure cloud consists of one master nodeand four worker nodes HDInsight enables the deploymentand provisioning of Apache Hadoop clusters on the Azurecloud platform [45] Apache Hadoop amp YARN version 260is considered in the deployment of Bwasw-Cloud Hadoopcluster of Bwasw-Cloud consists of 4 worker nodes of A3 VMinstances and one master nodeHomo sapiens chromosome 1(NC 00000111) consisting of approximately 250 million bpis considered for evaluations Overlapping of 10 k is consid-ered The query sequence segments are obtained from theHomo sapiens chromosome 1 segment reads The length ofthe queries is varied (based on the read lengths) in terms of1000 bp 5000 bp and 10000 bp A total of 150 reads per lengthare considered Generated log files obtained after the
1000 5000 10000Length of reads
05
1015202530354045
Mak
espa
n tim
e (s)
BWASW-PMRBwasw-Cloud
Long read alignmentmdashmakespan time on public cloud
Figure 7 BWASW-PMR and Bwasw-Cloud makespan time com-parisons for 150 reads of varied length considering the Azure publiccloud deployments
execution are studied to derive observations and resultsThe total makespan observed that is T is shown inFigure 7The results prove that BWASW-PMR exhibits lowermakespan time when compared to Bwasw-Cloud As lengthsof sequence reads increase execution time increases dueto the increase in alignment length sequences consideredLong sequence alignment considering BWASW-PMR gainsa speed-up of 133 when compared to the Bwasw-Cloudaligner The parallel executions of map and reduce phasesalong with SW optimization are the main contributingfactors to the speed-up observed in this study
Detailed analysis of the experimental data reveals parallelexecution of map function and SW optimization aid inreducing map makespans The major alignment computa-tions are carried out in the map phase considering bothBWASW-PMR and Bwasw-Cloud An average reduction inthe map phase makespan across all the experiments achievedby BWASW-PMR against Bwasw-Cloud is about 30 Theaccumulation of the results that is aggregation of the align-ment locations is carried out in the reduce phase No majorcomputation is carried out in the reduce phase of BWASW-PMR and Bwasw-Cloud The parallelization of the reduce
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
10 BioMed Research International
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
2 4 6 8 10 12 14 16 180Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
5 10 15 20 250Makespan time (s)
Long read sequence aligner execution considering
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
5 10 15 20 25 300Makespan time (s)
Bwasw-Cloud (10000 b read)
Bwasw-Cloud (1000 b read)
Bwasw-Cloud (5000 b read)
(a)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 120Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
n
Long read sequence aligner execution considering
2 4 6 8 10 12 14 16 180Makespan time (s)
Reduce worker 1Reduce worker 2Reduce worker 3Reduce worker 4
Map worker 1Map worker 2Map worker 3Map worker 4
Task
exec
utio
nLong read sequence aligner execution considering
5 10 15 200Makespan time (s)
BWASW-PMR (10000 b read)
BWASW-PMR (5000 b read)
BWASW-PMR (1000 b read)
(b)
Figure 8 Long read sequence alignment execution makespans of the map and reduce worker nodes (a) Bwasw-Cloud on Hadoop clusterof 4 nodes (b) BWASW-PMR on Azure cluster of 4 nodes
function adopted in BWASW-PMR achieves an averagereduction of 93 in TREDUCE when compared to Bwasw-Cloud To prove efficiency in adopting a parallel map andreduce execution environment for the BWASW-PMR mak-espans of the tasks executed at the map and reduce workernodes are noted The results obtained are shown in Figure 8In Figure 8(a) the makespans of the worker nodes in Bwasw-Cloud deployed on the Hadoop cluster are shown Themakespans of the worker nodes in BWASW-PMR deployedon Azure are shown in Figure 8(b) From Figure 8 it isclear that the uniform tasks executed on BWASW-PMRexhibit lower makespans when compared to the execution onBwasw-Cloud In Figure 8(a) the reduce phase is initializedwhen all the map workers have completed their jobs InBWASW-PMR the reduce workers are running in parallelThe reduce phase is initiated when one or more of the mapworker nodes have completed their jobs
The SW optimization considered in BWASW-PMRreduces execution time of sequences to be aligned based onthe SW algorithm The experimental study and the results
obtained prove that BWASW-PMR is capable of aligning longsequences using the cloud computing platform
5 Comparison Notes withRelated Work Counterparts
In this section a comparison of BWASW-PMR with exist-ing cloud aligners for long sequence reads is presentedComparisons with Bwasw-Cloud [34] MapReduce-BLAST[33] CloudAligner [13] and the BWA-SW [10] alignersare considered The BWA-SW sequence aligner [10] forlong reads is memory hungry (high RAM requirements)Moreover execution on the cloud platform is not consideredTheCloudAligner [13] adopts the seed-and-extend algorithmfor sequence alignment The BWA-SW and BLAST alsoadopt similar approach Though the seed-and-extend basedaligners are fast they suffer from low accuracy [10] and aremore suitable for short read alignments To achieve accuratealignment results considering long reads the SW algorithmis adopted in the BWA-SW MapReduce-BLAST [33] is
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 11
Table 3 Comparison of BWASW-PMR with its related work counterparts
BWASW-PMR Bwasw-Cloud[34]
MapReduce-BLAST[33] CloudAligner [13] BWA-SW
[10]Long sequence alignment support Yes Yes Yes Yes Yes
Alignment algorithm adopted BWA-SW BWA-SW Blast Seed-and-extendalgorithm BWA-SW
SW for accurate alignment Yes Yes No No YesCloud MapReduce platform execution support Yes Yes Yes Yes NoMapReduce platform considered Custom Hadoop Hadoop Hadoop NoSW optimization Yes No No No YesParallelization of MapReduce phases Yes No No No NoAccuracy Yes Yes No No Yes
a parallelized version of BLAST using MapReduce Bwasw-Cloud bears the closest similarity to our work and is used forcomparison in the experimental study presented here Cloud-based aligners that is [13 33 34] consider the Hadoopframework for deployment The drawbacks of the Hadoopframework have been discussed in the first section of thepaper To overcome these drawbacks BWASW-PMR hasbeen proposed in this paper Comparison of BWASW-PMRwith its related work counterparts is summarized in Table 3
6 Conclusion and Future Work
Sequencing and analyzing of genomic data are importantprocesses in bioinformatics Rapid developments of the NGStechnologies produce humongous amount of genomic dataFor analysis of genomic data alignment tools are used Thealignment tools are classified as short read type and long readtype The existing sequence aligners predominantly considershort read genomic sequences and lacking of support incloud environment These aligners exhibit deficiencies in thealignment of long sequence genomic data that are currentlygenerated using NGS technologies In NGS sequencers usersare required to wait for sufficient resources to becomeavailable in which the time needed to complete processingbecomes unpredictable More data is increasingly being gen-erated which leads to serious issues in storing and processingThe existing long read aligners that adopt the cloud platformfor computation suffer from drawbacks that are discussed inthis paper In this paper we combined cloud infrastructureand MapReduce framework together as a solution to supportlong read sequence alignment We proposed BWASW-PMRcloud platform to align long read sequences The BWA-SW algorithm is adopted for long sequence alignment inBWASW-PMR cloud platform Optimization of the SW algo-rithm and a Parallel MapReduce execution strategy are con-sidered in BWASW-PMR Parallel execution of the map andreduce functions is adopted to maximize resource uti-lization of the VM based cloud computing platform andreduce makespans This paper highlights comparison ofthe proposed BWASW-PMR with the existing systems forlong sequence alignments The experiments presented haveproven the efficiency of the optimized SW algorithm More-over comparison with state-of-the-art Bwasw-Cloud for long
sequence alignment is presented throughout the experimen-tal study The results obtained indicate significant improve-ment considering BWASW-PMR when compared to Bwasw-Cloud Our proposed BWASW-PMR is of use to the genomiccommunity to support the required computations for longsequence alignment efficientlyTheparallel executions ofmapand reduce phases along with SW optimization are the maincontributing factors to the speed-up observed in this study
In the future we propose to undertake optimization of theBWA-SW algorithm due to its memory hungry nature andalso accelerating BWA-MEM algorithm of Burrows Wheeleraligner [12] on different platform
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work was supported by the National Research Founda-tion of Korea (NRF) grant funded by the Korea government(MEST) (no NRF-2013R1A1A2013401)
References
[1] D W Mount Bioinformatics Sequence and Genome AnalysisCold Spring Harbor Laboratory Press Cold Spring Harbor NYUSA 2004
[2] Google Cloud Genomics Platform 2015 httpscloudgooglecomgenomics
[3] National Center for Biotechnology Information 2015 httpwwwncbinlmnihgov
[4] H Li and R Durbin ldquoFast and accurate short read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 25 no14 pp 1754ndash1760 2009
[5] B Langmead and S L Salzberg ldquoFast gapped-read alignmentwith Bowtie 2rdquo Nature Methods vol 9 no 4 pp 357ndash359 2012
[6] W R Pearson and D J Lipman ldquoImproved tools for biologicalsequence comparisonrdquo Proceedings of the National Academy ofSciences of the United States of America vol 85 no 8 pp 2444ndash2448 1988
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
12 BioMed Research International
[7] S F Altschul T L Madden A A Schaffer et al ldquoGappedBLAST and PSI-BLAST a new generation of protein databasesearch programsrdquo Nucleic Acids Research vol 25 no 17 pp3389ndash3402 1997
[8] W J Kent ldquoBLATmdashthe BLAST-like alignment toolrdquo GenomeResearch vol 12 no 4 pp 656ndash664 2002
[9] R Li Y Li K Kristiansen and J Wang ldquoSOAP short oligonu-cleotide alignment programrdquo Bioinformatics vol 24 no 5 pp713ndash714 2008
[10] H Li and R Durbin ldquoFast and accurate long-read alignmentwith Burrows-Wheeler transformrdquo Bioinformatics vol 26 no5 pp 589ndash595 2010
[11] Y Liu and B Schmidt ldquoLong read alignment based onmaximalexact match seedsrdquo Bioinformatics vol 28 no 18 pp i318ndashi3242012
[12] H Li ldquoAligning sequence reads clone sequences and assemblycontigs with BWA-MEMrdquo httparxivorgabs13033997v1
[13] T NguyenW Shi andD Ruden ldquoCloudAligner a fast and full-featured MapReduce based tool for sequence mappingrdquo BMCResearch Notes vol 4 article 171 2011
[14] M Bakery and R Buyyaz ldquoCluster computing at a glancerdquo inHigh Performance Cluster Computing Architectures and SystemR Buyyaz Ed pp 3ndash47 Prentice-Hall Upper Saddle River NJUSA 1999
[15] M C Schatz B Langmead and S L Salzberg ldquoCloud comput-ing and the DNA data racerdquoNature Biotechnology vol 28 no 7pp 691ndash693 2010
[16] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessing on large clustersrdquo in Proceedings of the 6th Symposiumon Operating SystemDesign and Implementation (OSDI rsquo04) pp137ndash150 San Francisco Calif USA December 2004
[17] M Zaharia M Chowdhury M J Franklin S Shenker and IStoica ldquoSpark cluster computingwithworking setsrdquo inProceed-ings of the 2nd USENIX Conference on Hot topics in Cloud Com-puting Boston Mass USA June 2010
[18] G Malewicz M H Austern A J C Bik et al ldquoPregel a systemfor large-scale graph processingrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo10) pp 135ndash145 ACM Indianapolis Ind USA June2010
[19] P C Church and A M Goscinski ldquoA survey of cloud-basedservice computing solutions for mammalian genomicsrdquo IEEETransactions on Services Computing vol 7 no 4 pp 726ndash7402014
[20] L D Stein ldquoThe case for cloud computing in genome informat-icsrdquo Genome Biology vol 11 no 5 article 207 2010
[21] Amazon Web Services Genomics 2015 httpsawsamazoncomhealthgenomics
[22] Microsoft Azure Cloud Computing Platform amp Services 2015httpsazuremicrosoftcomen-us
[23] Answers to genome analysis may be in the clouds 2015 httpscloudgooglecomgenomics
[24] Galaxy 2015 httpsusegalaxyorg[25] RKMadduriD Sulakhe L Lacinski et al ldquoExperiences build-
ing Globus Genomics a next-generation sequencing analysisservice using Galaxy Globus and Amazon Web ServicesrdquoConcurrency Computation Practice and Experience vol 26 no13 pp 2266ndash2279 2014
[26] J Zhang W Zhang H Wu and T Huang ldquoVMFDF a virtual-ization-based multi-level fault detection framework for highavailability computingrdquo in Proceedings of the IEEE 9th Inter-national Conference on e-Business Engineering (ICEBE rsquo12) pp367ndash373 Hangzhou China September 2012
[27] CWeng J Zhan and Y Luo ldquoTSAC enforcing isolation of vir-tual machines in cloudsrdquo IEEE Transactions on Computers vol64 no 5 pp 1470ndash1482 2015
[28] J E Smith and R NairVirtual Machines Versatile Platforms forSystems and Processes Elsevier New York NY USA 2005
[29] Hadoop httphadoopapacheorg[30] M C Schatz ldquoCloudBurst highly sensitive read mapping with
MapReducerdquo Bioinformatics vol 25 no 11 pp 1363ndash1369 2009[31] Y Li and S Zhong ldquoSeqMapReduce software and web service
for accelerating sequence mappingrdquo in Proceedings of the Criti-cal Assessment of Massive Data Anaysis (CAMDA rsquo09) ChicagoIll USA 2009
[32] B Langmead M C Schatz J Lin M Pop and S L SalzbergldquoSearching for SNPs with cloud computingrdquo Genome Biologyvol 10 article R134 2009
[33] X-L Yang Y-L Liu C-F Yuan and Y-H Huang ldquoParalleliza-tion of BLAST with MapReduce for long sequence alignmentrdquoin Proceedings of the 4th International Symposium on ParallelArchitectures Algorithms and Programming (PAAP rsquo11) pp 241ndash246 IEEE Tianjin China December 2011
[34] M Sun X Zhou F Yang K Lu andD Dai ldquoBwasw-cloud effi-cient sequence alignment algorithm for two big data withMapReducerdquo in Proceedings of the 5th International Conferenceon the Applications of Digital Information andWeb Technologies(ICADIWT rsquo14) pp 213ndash218 IEEE Bangalore India February2014
[35] J Ekanayake H Li B Zhang et al ldquoTwister a runtime for iter-ative MapReducerdquo in Proceedings of the 19th ACM Interna-tional Symposium on High Performance Distributed Computing(HPDC rsquo10) pp 810ndash818 Chicago Ill USA June 2010
[36] D Jiang A K H Tung and G Chen ldquoMAP-JOIN-REDUCEtoward scalable and efficient data analysis on large clustersrdquoIEEE Transactions on Knowledge and Data Engineering vol 23no 9 pp 1299ndash1311 2011
[37] D Dahiphale R Karve A V Vasilakos et al ldquoAn advancedMapReduce cloud MapReduce enhancements and applica-tionsrdquo IEEE Transactions on Network and Service Managementvol 11 no 1 pp 101ndash115 2014
[38] T F Smith and M S Waterman ldquoIdentification of commonmolecular subsequencesrdquo Journal of Molecular Biology vol 147no 1 pp 195ndash197 1981
[39] O Gotoh ldquoAn improved algorithm for matching biologicalsequencesrdquo Journal ofMolecular Biology vol 162 no 3 pp 705ndash708 1982
[40] P Ferragina and G Manzini ldquoOpportunistic data structureswith applicationsrdquo in Proceedings of the 41st Annual Sympo-sium on Foundations of Computer Science pp 390ndash398 IEEERedondo Beach Calif USA November 2000
[41] P D Vouzis and N V Sahinidis ldquoGPU-BLAST using graphicsprocessors to accelerate protein sequence alignmentrdquo Bioinfor-matics vol 27 no 2 Article ID btq644 pp 182ndash188 2011
[42] G F Pfister In Search of Clusters The Coming Battle in LowlyParallel Computing Prentice Hall 1995
[43] Influenza Virus Resource 2015 httpwwwncbinlmnihgovgenomesFLUFLUhtml
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 13
[44] Saccharomyces genome database (SGD) 2015 httpwwwyeastgenomeorg
[45] V K Baheti ldquoWindows azure HDInsight where big data meetsthe cloudrdquo in Proceedings of the Conference on IT in BusinessIndustry and Government (CSIBIG rsquo14) pp 1ndash2 Indore IndiaMarch 2014
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology