Parallelized Multiple Sequence Alignment on the Public Cloud
Presented by:Dr. G.Sudha SadasivamProfessor, Dept of CSE,
PSG College of Technology, Coimbatore
Co-authorsMr B. Vijayan, Mr S. Arul Prakash, Mr K.V. Hari Babu
Students, BE(CSE), Dept of CSE, PSG College of Technology,
Coimbatore
Agenda Sequence alignment Introduction to Clouds Approaches for MSA Problem statement System Architecture Illustration of working of the system Analysis Experimental results Conclusion
What is Sequence Alignment?
The procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. Uses
For sequence similarity Phylogenetic tree analysis
Factors – accuracy and speed
Cloud computingProvides scalable, on-demand, RT computing services
Suitability of cloud for Sequence Alignment On-demand scalability of cloud makes it suitable for
dynamic nature of MSA Low cost in maintenance of infrastructure for
applications Data and compute parallelism in clouds through map-
reduce paradigm facilitates energy efficient and fast MSA.
Types of Sequence Alignment Pair-wise Alignment
Alignment of two sequencesGlobal –using Needleman Wunsch algorithm.
L G P S S K Q T G K G S _ S R A W D N | | | | | | |
L N _ A T K S A G K G A I M R L G D ALocal – using Smith Waterman algorithm.
_ _ _ _ _ _ _ _ _ T G K G _ _ _ _ _ _ _ _ _ _ | | |
_ _ _ _ _ _ _ _ _ A G K G _ _ _ _ _ _ _ _ _ _
Multiple Sequence AlignmentAlignment of more than two sequences
MSA methodsDynamic Programming
(n – dim matrix)
Accurate Computationally complex
O(Nn)
Exhaustive
Progressive approximation
(aligns closest seq first - heuristics)
Fast Alignment Cannot be modified
Local maxima
Less accurate
ClustalW
MAFFT
Iterative Probabilistic/ Stochastic
(Random)
Slow & less accurate
GA & HMM
N- sequence length; n- number of sequences
MSA in cloud
CloudBurst – RMAP Does not split sequences to load in cloud
environment Not for MSA No automatic scale up/down of clusters
CLUE- proposal from Maryland University VM cloning – Snowflock with MPIs
Problem statementTime efficient approach to sequence alignment with quality
(accuracy) in Cloud
Using hadoop framework Dynamic approach accuracy Data and compute parallelism in hadoop speed Blocking and scalability of hadoop
Parallel transfer of sequence splits over the network to remote clusters
Automated scale up/down of clusters based on computational needs of th environment.
Initialization
F(0, 0) = 0
F(0, i) = −i * d
F(j, 0) = −j* d Main Iteration
For each i=1…M and j=1….N
F(i-1,j-1)+s(xi,yj), case 1F(i,j) = max F(i-1,j)-d, case 2
F(i,j-1)-d, case 3
DIAG, if case 1 Ptr(i,j) = UP, if case 2 LEFT, if case 3
Case 1: xi aligns to yi Case 2: xi aligns to gapCase 3: yi aligns to gap
Needleman Wunsch Algorithm
s(xi,yj ) = +1 , match -1 , mismatch
Needleman Wunsch Algorithm
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
F(i,j) i=0 1 2 3 4
j=0
1
2
3
f(0,0)+s(1,1) =1F(1,1)=max f(0,1)-1 = -2 f(1,0)-1 = -2 = 1(case 1)
Optimal Alignment A_TA AGTA
Case 1: xi aligns to yi Case 2: xi aligns to gapCase 3: yi aligns to gap
s(xi,yj ) = +1, match -1, mismatch
d=1
PTR =DIAG, if case 1UP, if case 2LEFT, if case 3
f(0,1)+s(1,2) =-2f(0,2)-1 = -3f(1,1)-1 = 0Max = 0 (case 3)
F(0, 0) = 0F(0, i) = −i * dF(j, 0) = −j* d
F(i-1,j-1)+s(xi,yj)F(i-1,j)-dF(i,j-1)-d
A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA.
The input is a set of query sequences that are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor.
From the resulting multiple sequence alignment , phylogenetic analysis can be conducted to assess the sequences shared evolutionary origins.
Multiple Sequence Alignment
Direct method for MSA to identify the globally optimal alignment solution .
Computational complexity n-dimensional equivalent of the pairwise alignment
matrix is formed. The search space increases exponentially with
increasing n and is strongly dependent on sequence length(N).
O(Nn)
Dynamic Programming
Heuristic search . builds up a final MSA by combining pair wise alignments
beginning with the most similar pair and progressing to the most distantly related.
Stages: The relationships between the sequences are represented
as a tree, called a guide tree (pairwise alignment scores). The MSA is built by adding the sequences sequentially to
the growing MSA according to the guide tree.
seq 1seq 2seq3seq4
According to guide tree, 1) Align seq 1 and 2, 2) Align seq 3 wrt seq 1 and 2, 3) Align seq 4 to that of seq 1, 2,
and 3.
Progressive Alignment
The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Random/ iterative approaches are used
Performance is also particularly bad when all of the sequences in the set are rather distantly related.
Drawbacks
AGT….CGAGT….CG
AGT….CGAGT….CG
AGT….CG
Head Server(VM)
New VMs
New VMs
New VMs
………...
2. Parallel transmission over Internet
4. Forking VMs / deleting VMs
CLIENT SIDE VIRTUAL ENVIRONMENT
6. Report the resultSEQUENCE FRAGMENTS
1. Create virtual environment
2. Split the sequences
System Architecture
3. Copy to HDFS
5. Perform Alignment
SERVER SIDE HADOOP CLUSTER
D1,B1 D2,B1 D1,B2 D1,B3 D3,B1 D2,B2 D3,B2
M M M M M M M
K1,C1
K2,C1
K3,C1
K2,C2
K5,C2
K3,C2
K6,C3
K3,C3
K4,C3
K5,C4
K2,C4
K4,C4
K4,C5
K1,C5
K6,C5
K6,C6
K3,C6
K1,C6
K5,C7
K6,C7
K4,C7
Sort and Group (D2)
K1,[C6] K2,[C2] K3,[C2,C6] K5,[C2] K6,[C6]
Sort and Group (D1)
R R R R R R
K1,[C1] K2,[C1,C4] K3,[C1,C3] K4,[C4,C3] K5,[C4] K6,[C3]
R R R R R
K1,I K2,I K3, I K4, I K5, I K6,I K1, I K2, I K3, I K5, I K6, I
Map Task 1 Map Task 2 Map Task 3
Reduce Task 1 Reduce Task 2
Map reduce Architecture
0 1 2 3 4
A G T A
0 0 -1 -2 -3 -4
1 A -1 1 0 -1 -2
2 T -2 0 0 1 0
3 A -3 -1 -1 0 2
SCORE: 4
A1S1:“AGTA”; A1S2:“A_TA”
0 1 2 3 4
A G T A
0 0 -1 -2 -3 -4
1 G -1 -1 0 -1 -2
2 A -2 0 -1 1 0
3 T -3 -1 -1 0 -1
SCORE: -5
A2S1:“AG_TA”; A1S3:“_GAT_”
1. ALIGNMENT OF SI & S2
2. ALIGNMENT OF A1SI & S3
S1= “AGTA”; A2=“ATA”; A3=“GAT”
0 1 2 3 4 5
A _ T A _
0 0 -1 -2 -3 -4 -5
1 _ -1 0 0 -1 -2 -3
2 G -2 -1 -1 -1 -2 -2
3 A -3 -1 -1 -2 0 -1
4 T -4 -2 -1 0 -1 0
5 _ -5 -3 -1 -1 0 0
SCORE: -3
A2S2:“A _ _TA_”;
A2S3:“ _GAT_ _”
3. ALIGNMENT OF A1S2 & A1S3
Complexity Measure
Proposed Method
Conventional Method
Score Calculation
O(N) O(n*N)
Pairwise alignment
O(K2) O(N2)
MSA O[K2 * ( n(n-1)/2] O(Nn)
‘n’ – Number of Sequences
‘N’ – Average length of a sequence
‘k’ – Average number of blocks in a sequence
‘K’ – Size of 1 block
Analysis
‘T’ – Time for sequence transfer serially & ‘k’ – block size
T/k – Time for sequence transfer in parallel
Advantage: Computation power of remote cluster is optimal and not wasted
Disadvantage: Time to set up the cluster
2. Parallelised data trasfer
3. Dynamic cluster creation
Experimental Setup
Core – 2 Duo processors – 2.8 GHz - 160GB HD,
2 GB RAM LAN- 100 Mbps. OS - RHEL v5 Client virtual environment - 4 VMs Server cluster - 5 machines Hadoop DFS in fully distributed mode OpenVZ was used for virtualization
Effect of parallel file transfer
FileSize(MB)
FileTransfer(sec)
Split Time(sec)
Merge Time(sec)
C1(sec)
T1 (sec)
C2(sec)
T2 (sec)
100 6.23 0.02 0.03 2.13 2.18 0.73 0.78
200 9.32 0.23 0.43 2.96 3.62 1.23 1.89
300 11.43 0.85 1.64 3.84 6.33 1.16 3.65
C1: Communication time from 3 client VMs to server without multithreading.C2: Communication time from 3 client VMs to the server with multithreading.T1: Total time for file transfer from client to server without multi threading T2: Total time for file transfer from client to server with multi threading
Time to start virtual machines
0
20
40
60
80
100
120
1 2 3 4
Number of VMs
Tim
e in
Sec
Parallelised starting of VMs can be done to reduce time
cluster performance wrt number of VMs 30 KB sequences with 2 KB splits – upto 5 sequences
Number of sequences is less than 6, a five node hadoop cluster is sufficient.
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 10Number of sequences
Tim
e in
Sec
4 slave VMs (sec) 6 slave VMs (sec)
3 4 5 6 7 8 9 10 11 12
Dynamic scaling up/down of clusters
File Size (GB)
Block size (10 MB)
Static VM creation based on Predicted application load (maps + reduces)
Dynamic VM creation based on actual application load (maps + reduces)
Time (min -sec)
VMs Time (min-sec)
New VMs added
1 5-36 2 3-16 1
2 5-52 3 5-40 1
3 8-27 4 5-48 2
5 12-13 5 6-39 9
VMs instantiated based on number of Map-Reduce Tasks
Dynamically number of tasks were checked up New VMs started and tasks were reallocated
Old VMs were destroyed if not used
Conclusion1) Proposed MSA improves on the computation time and also
maintains the accuracy. Parallelism of sequence alignment in three levels.
Hadoop data grids - Data and compute parallelism & scalability
Dynamic Programming - accuracy.
2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)] Combining progressive and dynamic approaches. Blocking in hadoop
3) Enhancements (using clouds for MSA) Automatic configuration of the cloud environment based on
the computational needs Efficient upload of data into the HDFS by parallel transfer of
sequence fragments over the Internet.
Acknowledgements
The Research has been carried out as a result of PSG-Yahoo Research programme on Grid and Cloud computing.
Sincere Thanks to
1) Dr R Rudramoorthy, Principal,
PSG College of Techniology, Coimbatore.
2) Mr K V Chidambaran,
Director, Grid and Cloud Systems Group,
Yahoo, Bangalore
REFERENCES Apache, (2002), Hadoop Documentation, retrieved on September 20, 2009,
fromhttp://hadoop.apache.org/core/docs/r0.17.2/. Tahir, N., Imitaz, S. and Shaftab, A., “Parallel Needleman-Wunsch Algorithm for
Grid”. retrieved on January 19, 2009 from http://www.gridbus.org/~alchemi/files/Parallel%20Needleman% 20Algo.pdf
Michael, C., (2009). “Cloud Burst: highly sensitive read mapping with MapReduce”, Bioinformatics, 25(11), 1363-1369.
Lee, T., “A genomic CluE for Cloud Computing”, retrieved on January 13, 2009 from http://www.eurekalert.org/pub_releases /2009-04/uom-agc042309.php
Yongli, H. and Shen, J., “Sequence analysis scale up and acceleration using Grid and Cloud Computing yield efficient analyses of HIV-1 variants and other viruses”, retrieved on February 15, 2009 from www.iscb.org /uploaded/css/43/12056.pdf.
Philip, P., Andres, L., Eyal, L. and Michael, B. “Adding the easy button to the cloud with SnowFlock and MPI”, in Proceedings of 3rd ACM workshop in system level virtualization for HPC (2009), 122-127.
Top Related