Post on 15-Feb-2022
Research ArticleAn Efficient Parallel Algorithm for Multiple SequenceSimilarities Calculation Using a Low Complexity Method
Evandro A. Marucci,1 Geraldo F. D. Zafalon,1 Julio C. Momente,1
Leandro A. Neves,1 Carlo R. ValΓͺncio,1 Alex R. Pinto,2 Adriano M. Cansian,1
Rogeria C. G. de Souza,1 Yang Shiyou,3 and JosΓ© M. Machado1
1 Department of Computer Science and Statistics, Sao Paulo State University, Rua Cristovao Colombo 2265,15054-000 Sao Jose do Rio Preto, SP, Brazil
2 Department of Control Engineering and Automation, Federal University of Santa Catarina, Rua Pomerode 710,89065-300 Blumenau, SC, Brazil
3 College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China
Correspondence should be addressed to Geraldo F. D. Zafalon; zafalon@sjrp.unesp.br
Received 12 February 2014; Accepted 2 July 2014; Published 22 July 2014
Academic Editor: Tzong-Yi Lee
Copyright Β© 2014 Evandro A. Marucci et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.
With the advance of genomic researches, the number of sequences involved in comparative methods has grown immensely. Amongthem, there are methods for similarities calculation, which are used by many bioinformatics applications. Due the huge amount ofdata, the union of low complexity methods with the use of parallel computing is becoming desirable. The k-mers counting is a veryefficientmethodwith good biological results. In this work, the development of a parallel algorithm formultiple sequence similaritiescalculation using the k-mers counting method is proposed. Tests show that the algorithm presents a very good scalability and anearly linear speedup. For 14 nodes was obtained 12x speedup. This algorithm can be used in the parallelization of some multiplesequence alignment tools, such as MAFFT and MUSCLE.
1. Introduction
The use of sequence comparison methods has been remark-ably growing in recent years, in response to the dataexpansion of genomic research. Consequently, methods thatreduce the execution time are fundamental to the progressof this area. Many efforts have been made concerning theiroptimization [1, 2].
With the increase in number of multiple sequences align-ments problems, the development of methods which have alower computational complexity also increases. Nevertheless,just the creation of low complexity methods was not enoughtoworkwith high volumeof data [3, 4].The interest in parallelcomputing has grown and, hence, several parallel methodswere created and embedded in many sequence comparisontools [5].
In general, the sequence comparison starts with thesimilarity calculation between pairs of sequences. It occurs
through methods which require or not a prior alignment,varying the algorithm complexity and the level of biologicalaccuracy. The alignment free methods correspond to a classof low complexity methods, which have a low temporal andspatial complexity in relation to the methods which requiresa prior alignment. Moreover, they also keep a high level ofbiological accuracy for divergent sequences. For this reason,they are becoming increasingly important in computationalbiology.
The similarity calculation methods shall be classified intotwomain categories:methods based on the counting of wordsandmethods that do not involve such a counting. Among themethods based on the counting of words there is the k-merscounting method, which was proposed by Katoh et al. [6].This method counts the number of k-mers (words of size k)shared by a pair of sequences, using it as an approximationof the similarity level. It also uses a different alphabet, built
Hindawi Publishing CorporationBioMed Research InternationalVolume 2014, Article ID 563016, 6 pageshttp://dx.doi.org/10.1155/2014/563016
2 BioMed Research International
on the basis of statistical data, maximizing the biologicalaccuracy in relation to the previously proposed methods.
The k-mers counting method was implemented in someimportant multiple sequences alignment tools, such asMAFFT [6, 7] and MUSCLE [1], and to the extent of ourknowledge there is no other work in the literature that hasdeveloped and tested a parallel approach, specifically of thismethod. Once they are important tools, giving excellentresults, in both biological accuracy and computational com-plexity [8], the development of a parallel algorithm for multi-ple sequence similarities calculation using the k-mers count-ingmethod is very useful. Although a parallelMUSCLE existsfor shared memory systems [3], it does not include the paral-lelization of this stage. The parallel algorithm can be used inthe development of any parallel tool that requires it in oneof your stages. Tree construction or progressive multiplealignment tools, in general, are some examples.
This paper presents the k-mers counting method withthe proposed similarity calculation parallel algorithm formultiple sequences through this method. The algorithm wasdeveloped for distributedmemory parallel systems, using thelibrary MPI [9].
2. Materials and Methods
2.1. Word Counting Methods. In general, word countingmethods start with the mapping from sequences to vectorswhich store the length of each word. These words aresubsequences of length π, also known as a π-tuple.
In order to understand the behavior of thesemethods, it isinteresting to perform a review about some words statisticalconcepts. Then, consider a sequence π, of length π, definedas a segment of π symbols of a finite alphabet π΄, of length π.
A segment of π symbols, with π β€ π, is a π-tuple. The setππ consists of all possible k-tuples from the alphabet set Aand has N elements:
ππ = {π€π,1, π€π,2, . . . , π€π,π} ,
π = ππ.
(1)
Then, we count the number of π-tuples of ππ whichappear in the sequence π. Computationally, this count isnormally made moving a window of size π through thesequence, from the position 1 until the position π β π + 1.The vector π
π
πis responsible for the storage of the number of
occurrences of π-tuples in the sequence π:
ππ
π= (ππ
π,1, ππ
π,2, . . . , π
π
π,π) . (2)
A frequency vector ππ
πcan then be gotten from the
relative quantity of each π-tuple:
ππ
π=
ππ
π
βπ
π=1πππ,π
β‘ ππ
π,π=
ππ
π,π
π β π + 1. (3)
As an example of the use of these structures, imaginea DNA sequence, where π΄ = {π΄, π, πΆ, πΊ} and π = 4. Forπ = 3, π΄ππΆ and π΄π΄π΄ are π-tuple belonging to the set π3.
Table 1: Examples of compressed alphabets.
Alphabet (π) ClassesSE-B (14) A, C, D, EQ, FY, G, H, IV, KR, LM, N, P, ST, WSE-B (10) AST, C, DN, EQ, FY, G, HW, ILMV, KR, PSE-V (10) AST, C, DEN, FY, G, H, ILMV, KQR, P, WLi-A (10) AC, DE, FWY, G, HN, IV, KQR, LM, P, STLi-B (10) AST, C, DEQ, FWY, G, HN, IV, KR, LM, PSolis-D (10) AM, C, DNS, EKQR, F, GP, HT, IV, LY, WSolis-G (10) AEFIKLMQRVW, C, D, G, H, N, P, S, T, YMurphy (10) A, C, DENQ, FWY, G, H, ILMV, KR, P, STSE-B (8) AST, C, DHN, EKQR, FWY, G, ILMV, PSE-B (6) AST, CP, DEHKNQR, FWY, G, ILMVDayhoff (6) AGPST, C, DENQ, FWY, HKR, ILMV
For the sequence π = π΄ππ΄ππ΄πΆ, where π = 6, the countingand frequency vectors (ππ
3and π
π
3, resp.) are constructed as
π-tuples of all π3 which are identified in the sequence π.The sequence π is travelled in a window of size π = 3. Theword within each window is compared with the words ofπ3. In this case, π β π + 1 = 4 comparisons are necessary(π΄ππ΄, ππ΄π,π΄ππ΄, ππ΄πΆ):
π3 = {π΄ππ΄, ππ΄π, ππ΄πΆ,π΄π΄π΄, . . .} ,
ππ
3= (2, 1, 1, 0, . . .) ,
ππ
3= (0.5, 0.25, 0.25, 0, . . .) .
(4)
2.2. The k-mers Counting Method. In the k-mers countingmethod, we use the term k-mer to represent the words, or π-tuples. This method presents a considerably greater speed inrelation to conventional methods, which require alignment[10]. Its algorithm, implemented to determine the numberof k-mers shared by two sequences, is π(π), for sequences ofsize π. Differently, the conventional algorithms, which requirealignment, are π(π
2).
This algorithm uses, in general, a little different alphabet.In most cases, the alphabet used is a variation of the defaultalphabet. Known by compressed alphabets, these alphabetscontain symbols that denote classes that correspond to two ormore different types of residues (each residue is representedby a letter).
For amino acids sequences, a compressed alphabet πΆ ofsize π is a partition of the default amino acids alphabet π΄,which contains 20 letters, in π disjoined classes containingsimilar amino acids. Table 1, extracted from [10], shows someexamples of compressed alphabets.
With the use of compressed alphabets the identity ishighly conserved. Pairs of related sequences have alwaysa greater or equal identity and, therefore, more k-mers incommon in an alphabet smaller than the default alphabet. Anexample of this characteristic can be seen in Table 2.
In Table 2, the upper and lower alignment are the same.The difference is that the first uses A, as default amino acidsalphabet, while the second uses the compressed alphabet SE-V(10), whose members of the classes are represented by their
BioMed Research International 3
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1 2 3 4 5 6 71 2 3 4 5 6 71 2 3 4 5 6 7
14 9225 77 9118 59 33 8921 39 39 32 8442 14 61 26 35 9221 34 52 71 12 43 90
8014 9225 77 9118 59 33 8921 39 39 32 8442 14 61 26 35 9221 34 52 71 12 43 90
8014 9225 77 9118 59 33 8921 39 39 32 8442 14 61 26 35 9221 34 52 71 12 43 90
Processor 1:2 and 5
SimVect ={80,{
18, 59, 33, 89, 21, 34, 52,
71, 12, 43, 90
Processor 2:LinesSimVect ={14, {92, 21, 39, 39, 32, 84 26, 35, 92
3 and 6
Processor 3:LinesSimVect ={25, {77, 91, 42, 14, 61,
80
Lines 1, 4, and 7
Figure 1: Example of how the similarities matrix calculation is distributed between the slaves.
Table 2: Comparison between the use of the alphabet A and thecompressed alphabet SE-V (10).
Seq1: sAaNiLvGEnlvcKvaDFGLARlSeq2: aArNiLvGEnyicKvaDFGLARlSeq3: aA
βrNβvLβiGβEβdnvaKβicDβFβGβLβAβRβv
Using the default amino acids alphabetSeq1: AAaNILIGENlIcKIaDFGLARISeq2: AArNILIGENyIcKIaDFGLARISeq3: A
βAβrNβIβLβIβGβEβNβnIβaKβIβcDβFβGβLβAβRβIβ
Using the compressed alphabet SE-V (10)(each class member is represented by the first letter in alphabeticalorder)
first letters in alphabetical orderβπΌ, πΏ, π, and π are shownas πΌ, for example. The columns which are fully conserved areindicated with capital letters and with an asterisk below. Thenumber of conserved columns (π = 1) increased from 12 inA to 19 in SE-V(10). For π = 3, the number of fully conservedk-mers increased from 4 in A to 10 in SE-V(10) and, for π = 4,from 3 to 8.
The choice of the alphabet and the value of π is based onstatistics and has strong impact on the number of conservedidentities. If the alphabet is selected, whichmeans that there isa high probability of residues replacements of the same classand a low probability of residues replacements from distinctclasses, then we probably have an increase in the numberof identities detected. Moreover, the value of π confines thisincreasing in regions of continuous identity. As the sequencesdiffer, the number of conserved k-mers is reduced, reachinga limit compared to the expected number of no relatedsequences. The use of compressed alphabets increases thelikelihood of this limit be reached at a greater evolutionarydistance. Subtle choices of the size of this alphabet and thevalue of π can provide a better measure of similarity.
The following equation shows how we calculate thesimilarity between sequencesπ andπ by the k-mers countingmethod:
πΉ (π, π) =βπmin [ππ (π) , ππ (π)]
[min (πΏπ, πΏπ) β π + 1]. (5)
Here π is a k-mer, πΏπ and πΏπ are the sequences lengths,and ππ(π) and ππ(π) are the number of times π appears in π
and π, respectively.
2.3. Multiple Sequence Similarities Calculation. In anyapplication that performs a comparison between multiplesequences, either by multiple alignment or just by theconstruction of phylogenetic trees, we perform the similaritycalculation in many independent sequences. Thus, the πΉ
value is obtained for all pairs of sequences involved in theprocessing. As πΉ(π, π) equals πΉ(π,π), this value is calculatedonce, by two nested loops, as
for (i = 1; i < num seq; ++i)for (j = 0; j < i; ++i)M[i,j] = F(i,j)All obtained values of πΉ are, therefore, stored in a
triangular matrix π.
2.4. Parallel Algorithm. In an application, the amount ofsequences in comparison can be huge, making its implemen-tation unfeasible in a singlemachine, even with the use of lowcomputational complexity methods, as the k-mers countingmethod. For this reason, we propose a parallel algorithm thatcalculates the similarities of multiple pairs of sequences usingthe k-mers counting method.
This algorithm dynamically divides the computationamong existing processors through a master-slave approach.This parallelism is performed distributing the similaritycalculation of sequence pairs to available slaves. It is possiblebecause each similarity pair calculation is independent ofother calculations of similarity pair.
Initially, themaster sends all sequences by broadcasting toall slaves. Using broadcast themaster can reduce the overheadwith the messages exchange, decreasing the communicationtime between processors.
The tasks distribution is based on the processor identifier,assigning to each slave the calculation of specific lines ofthe triangular matrix of similarities. The way this matrixis obtained is exemplified in Figure 1. Each slave initiallycalculates the corresponding line of the identifier, in a π-steploop, where π is the number of processors. In Figure 1, wehave seven sequences and, hence, seven lines in the matrix.The first slave is responsible for the calculation of the lines 1,
4 BioMed Research International
80
92
91
89
84
92
90
14
25 77
18 59 33
21 39 39 32
42 14 61 26 35
21 34 52 71 12 43
1
2
3
4
5
6
7
1 2 3 4 5 6 7
Figure 2: Similarities matrix that is built by the master processor.
Send all sequencesto all processors via broadcast
Receive SimVect fromprocessor P
P is greaterthan the number
of processors?
Start
Yes
No
End
Receive all sequencesfrom master processor
Start
End
Send SimVect tomaster processor
Slaves
Master
P = 1
Mount from SimVectpart of matrix L
β
P = P + 1
Get the quantity of k-tuples in commonbetween pairs of sequences and
mount a vector (SimVect) which willbe sent to the master processorββ
Matrix L is a triangular matrixwhich contains the quantity of tuples
in common between all sequences
β
Each processor is responsible forthe calculation of specific lines of matrix L.
All computed values are stored in thevector SimVect.
ββ
Figure 3: Flowchart of the parallel algorithm for multiple sequence similarities calculation.
4, and 7, the second slave by lines 2 and 5 and the third slaveby lines 3 and 6. The gray values are the obtained results inother slaves which will only be joined in the master for thecreation of the similarities matrix.
During the matrix lines calculation, each slave stores theresults of all lines in a single vector (SimVect) which is sent
to master. The master receives all slave vectors and, fromthese vectors and the identifier the slave has sent, it builds thesimilarities matrix. From the previous example, the obtainedsimilarities matrix is shown in Figure 2.
In Figure 3 is showed the flowchart of the algorithmwhichperforms the task in Figures 1 and 2.
BioMed Research International 5
0
2
4
6
8
10
12
14
1
Tim
e
Number of processors
Num.: 567, max.: 1536, avg.: 467
2 3 4 5 6 7 8 9 10 11 12 13 14 15
(a)
Tim
e
0
5
10
15
20
25
30
35
40
Num.: 988, max.: 1543, avg.: 486
1
Number of processors2 3 4 5 6 7 8 9 10 11 12 13 14 15
(b)
Tim
e
0
50
100
150
200
Num.: 2000, max.: 1000, avg.: 650
1
Number of processors2 3 4 5 6 7 8 9 10 11 12 13 14 15
(c)
Tim
e0
0.2
0.4
0.6
0.8
1
1.2
Num.: 20, max.: 5413, avg.: 4911
1
Number of processors2 3 4 5 6 7 8 9 10 11 12 13 14 15
(d)
Figure 4: Processing time, in seconds, of the parallel algorithm for 2, 4, 8, and 15 nodes executing with four different datasets.
3. Results
3.1. Performance Evaluation. Many tests were performed tomeasure the performance of the proposed algorithms. Thedatasets used in these experiments have differences only onthe number of sequences to be aligned and their lengths.The sequences used were extracted from the NCBI database(http://www.ncbi.nlm.nih.gov/). For each performed test, wedescribe specific information of the dataset used, as thenumber of sequences and the average and maximum lengthof the sequences.
Tests were performed on a Beowulf cluster running underLinux Debian. The Beowulf cluster consists of 15 nodes, eachone composed by one AMDAthlon XP 2100+ processor with1 GB of RAM memory. The nodes are connected with a ded-icated 10/100 Fast Ethernet switch. The tests were performedwith 1, 2, 4, 8, and 15 nodes. The run times were measuredby executing them in stand alone mode, to ensure exclusiveuse of the communication and processors CPU andmemory.
In order to verify the proposed algorithm performance,we executed tests with four different datasets. For eachdataset, we verified the algorithm scalability when executedin a crescent machine number.
The first three graphics ((a), (b) and (c)) illustrated in Fig-ure 4 show an almost linear speedup for tests with more than500 sequences. Notice that the minimum system set for theparallel algorithm execution consists of two nodes, becausesuch algorithm has a master-slave model. One machine (themaster) is responsible for data management. The perfor-mance difference between the sequential algorithm and the
parallel one, when run on a minimum system, is almostzero. For this reason, we only showed the tests performedwith the parallel algorithm from the minimum system oftwo nodes until themaximumnumber of nodes in the cluster.
The graphic (d), also in Figure 4, illustrates also a constantperformance, regardless of the number of machines, for anentry with few sequences. In this case, the sequential imple-mentation of the algorithm is so fast for that input (approx-imately 1 second) that the speedup achieved with tasksdivision is close to the time spent with messages exchanges.
Comparing theminimum system (2 nodes) with themax-imum system (15 nodes) we have an increase of 14 nodes and,approximately, a 12x speedup. Therefore, it can be noticedthat the parallel algorithm presents excellent results witha nearly linear speedup.
4. Discussion
In this work, we presented a parallel strategy to calculate sim-ilarities between multiple pairs of sequences using the k-merscouting method, a low computational complexity methodwith good biological results. This calculation is used, forexample, inmultiple sequence alignment tools.The proposedparallel algorithm has been implemented for distributedmemory systems, due to the wide use of Beowulf clustersin genomic research laboratories.
The tests performed show that the algorithm presents agood scalability and a nearly linear speedup. With the useof 14 processing nodes (slaves), the system achieved a 12xspeedup. This speedup is justified by the total independence
6 BioMed Research International
of the scheduled tasks and by the good load balancingobtained with the triangular matrix lines distribution.
Additionally, the communication cost is minimizedbecause the computed data in slaves are sent to master in asingle message. Considering the 12x speedup achieved andusing Amdahlβs law [11] we can estimate that about 1.5% of theprocesses in the system are unparallelized, which reinforcesthe need of obtained optimization with the concentratedcommunication, avoiding message overload.
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper.
Acknowledgments
The authors would like to thank all of their collaboratorsand institutions for the support to the development of thepresent work. This work was partially supported by the SaoPaulo Research Foundation (FAPESP, Brazil) under Grant no.06/59592-0.
References
[1] R. C. Edgar, βMUSCLE: a multiple sequence alignment methodwith reduced time and space complexity,β BMC Bioinformatics,vol. 5, article 113, 2004.
[2] G. F. Zafalon, E. A. Marucci, J. C. Momente, J. R. Amazonas, L.M. Sato, and J. M.Machado, βImprovements in the score matrixcalculation method using parallel score estimating algorithm,βJournal of Biophysical Chemistry, vol. 4, no. 2, pp. 47β51, 2013.
[3] X. Deng, E. Li, J. Shan, and W. Chen, βParallel implementationand performance characterization of muscle,β in Proceedings ofthe 20th IEEE Parallel and Distributed Processing Symposium(IPDPS β06), 2006.
[4] H. A. Arcuri, G. F. D. Zafalon, E. A. Marucci et al., βSKPDB:a structural database of shikimate pathway enzymes,β BMCBioinformatics, vol. 11, article 12, pp. 1β7, 2010.
[5] E. A. Marucci, G. F. Zafalon, J. C. Momente et al., βUsingthreads to overcome synchronization delays in parallel multipleprogressive alignment algorithms,β American Journal of Bioin-formatics, vol. 1, no. 1, pp. 50β63, 2012.
[6] K. Katoh, K.Misawa, K. Kuma, and T.Miyata, βMAFFT: a novelmethod for rapid multiple sequence alignment based on fastFourier transform,β Nucleic Acids Research, vol. 30, no. 14, pp.3059β3066, 2002.
[7] K. Katoh and H. Toh, βParallelization of the MAFFT multiplesequence alignment program,β Bioinformatics, vol. 26, no. 15,Article ID btq224, pp. 1899β1900, 2010.
[8] R. C. Edgar and S. Batzoglou, βMultiple sequence alignment,βCurrent Opinion in Structural Biology, vol. 16, no. 3, pp. 368β373, 2006.
[9] M. J. Quinn, Parallel Programming in C withMPI and OpenMP,McGraw-Hill Education, New York, NY, USA, 1st edition, 2003.
[10] R. C. Edgar, βLocal homology recognition and distance mea-sures in linear time using compressed amino acid alphabets,βNucleic Acids Research, vol. 32, no. 1, pp. 380β385, 2004.
[11] G. M. Amdahl, βValidity of the single processor approach toachieving large scale computing capabilities,β in Proceedings ofthe Spring Joint Computer Conference (AFIPS β67), 1967.
Submit your manuscripts athttp://www.hindawi.com
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporation http://www.hindawi.com
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttp://www.hindawi.com
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
International Journal of
Microbiology