InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based...
-
Upload
dayna-gregory -
Category
Documents
-
view
216 -
download
0
Transcript of InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based...
![Page 1: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/1.jpg)
InCoB2007 - August 30, 2007 - HKUST
“Speedup Bioinformatics Applications on Multicore-based Processor using
Vectorizing & Multithreading Strategies”
King Mongkut’s Institute of Technology, Ladkrabang,
Thailand
National Center for Genetic Engineering and Biotechnology, Thailand
Dr. Surin KittitornkunDr. Sissades Tongsima
Kridsadakorn [email protected]
1
![Page 2: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/2.jpg)
Outline
Introduction Case Study Existing works Speedup of our approach Comparison Discussion Our strategies Limitation Conclusion
2
![Page 3: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/3.jpg)
Motivation
New modern processors are launched How to make a use of new technologies?
Dual-core CPU Quad-core CPU
3
![Page 4: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/4.jpg)
Motivation [2]
What is the difference between old and new CPUs?
4
Dual-core, Max. speedup ~2x Quad-core, Max. speedup ~4x
![Page 5: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/5.jpg)
Problems
Old sequential software is still used?Yes, especially the science and bioinformatics tools
Why do the scientists still use?Mostly they care about novel algorithms and
knowledge. They don't care about speed Why don't we use the PC cluster?
Very expensive, consume much more electric power. You don't need the PC cluster if you want to use a small software for searching, matching or grouping data
5
![Page 6: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/6.jpg)
Our Contribution
The hardware was changed, Old sequential software should be changed. To harness the power of the new multicore architecture certain compiler techniques must be considered
Using a popular ClustalW application as our case study, the optimization and multithreading techniques were applied to speedup ClustalW
6
![Page 7: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/7.jpg)
Case Study: ClustalW
ClustaW is a general purpose multiple alignment program for DNA or proteins.
7
![Page 8: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/8.jpg)
All pairwisealignments
ClustalW example
S1 ALSKS2 TNSDS3 NASKS4 NTSD
S1 S2 S3 S4
S1 0 9 4 7
S2 0 8 3
S3 0 7
S4 0
1. Align S1 with S3
2. Align S2 with S4
3. Align (S1, S3) with (S2, S4)
Distance Matrix
Multiple Alignment Steps
NeighborJoining
-ALSKNA-SK
-TNSDNT-SD
-ALSK-TNSDNA-SKNT-SD
MultipleAlignment
S1 S3
S2
S4
Distance
8
![Page 9: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/9.jpg)
Existing works
ClustalW-MPI: ClustalW analysis using distributed and parallel computingK.B. Li, Bioinformatics 19, 2003
Parallel MSA: Parallel Multiple Sequence Alignment with Dynamic SchedulingJ. Luo, I. Ahmad, M. Ahmed and R. Paul, ITCC’05
SGI: Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTALD. Mikhailov, Haruna C., and R. Gomperts, SGI ChemBio
9
![Page 10: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/10.jpg)
Speedup of our approach
*Note: Running mode defines as follows: (I) ClustalW without optimization (II) ClustalW with optimization (III) ClustalW with optimization and our assist (IV) MT-ClustalW without optimization (V) MT-ClustalW with optimization (VI) MT-ClustalW with optimization and our assist
2.12244,672474,1095,472,407VI
1.98253,188473,3595,900,891V
1.70252,984511,0477,009,875IV
1.21327,985880,9699,656,750III
1.14338,016881,12510,387,046II
-333,110932,71811,918,672I
Test data - 800 sequences, 1000 amino acids
ProgressiveAlignment
NeighborJoining
DistanceMatrix
Overallspeedup
Elapsed times (ms)Runningmode*
10
Data set Protein sequences from NCBIRun time: from 3 h. 40 m. down to 1 h. 43 m.
![Page 11: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/11.jpg)
ClustalW
Speedup of the optimized versions of ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.
10.00%
14.00%
18.00%
22.00%
26.00%
200 400 600 800
Number of sequences
Sp
eed
up
(%
)
len800, Only compiler-optimization len800, Optimization w ith our assist
len1000, Only compiler-optimization len1000, Optimization w ith our assist
11
![Page 12: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/12.jpg)
Multithreaded ClustalW
Speedup of the optimized versions of MT-ClustalW as a function of number of sequences. The sequence lengths are fixed at 800 and 1000 amino acids.
95.00%
100.00%
105.00%
110.00%
115.00%
200 400 600 800
Number of sequences
Sp
eed
up
(%
)
len800, Only compiler-optimization len800, Optimization w ith our assist
len1000, Only compiler-optimization len1000, Optimization w ith our assist
12
![Page 13: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/13.jpg)
Comparison
13
ClustalW-MPI Parallel MSA SGI ClustalW-MTV
Number of sequences 500 80 600 600
Sequence length 1100 289-399 390 400
Machine PC Cluster PC Cluster Single PCShared memory
Single PCShared memory
Processors 2 2 2 2
Speedup 1.75x 1.8x 1.8x 2.25x
Why does the speedup is over 2x?Because of the special unit in the new CPU
Does the special unit normally work with common software?No, we have to activate it.
![Page 14: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/14.jpg)
Speedup > 2x for dual-CPU? [1]
Amdahl’s Law
14
kf
fS
1
1S Speedup
Original Program
Modified Program
k
1-f f
![Page 15: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/15.jpg)
Speedup > 2x for dual-CPU? [2]
15
mtopttotal SpeedupSpeedupSpeedup
06.270.121.1 totalSpeedup
Speedup 1.21
Speedup 1.70
Data set 800 sequences, 1000 amino acids
![Page 16: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/16.jpg)
Our strategies
Step 1: Analyzing and Profiling To find the software structure and where the
bottle neck is Step 2: Applying the methodologies
Multithreading & Vectorizing (one of the optimization method)
Step 3: Validating To compare the result with the original one. For
sure, the result is not changed
16
![Page 17: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/17.jpg)
Strategy: Multithreading
The Proposed Multithreading StrategyTo improve the bottle neck of the software which
is non-threaded part To rise the throughput of the program by
applying multithreading strategy Reduce the overhead of thread creation
17
![Page 18: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/18.jpg)
Profile the software
Profiled by Intel Thread Profiler
Distance matrix
Neighbor joining
Progressive alignment
18
![Page 19: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/19.jpg)
Implementation
Apply the Thread library for this loop19
![Page 20: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/20.jpg)
Trick
Reduce Thread Creation Overhead
T1 T2 T2 T4
P1 P2 P3 P4
P5 P6 P7 P8
P9 P10 P11 P12
4 Threads
Parameters
20
![Page 21: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/21.jpg)
Strategy: Vectorizing
Proposed Optimizing and Vectorizing Methodology Find the frequent used functions in the programApplying the Loop Optimizing MethodologiesUse the advantage of Intel C++ Compiler to
optimize the code, also enable vectorizing option
21
![Page 22: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/22.jpg)
Frequent used functions
22
Function Clockticks (%) Methodology*
diff 33.36 A,B
prfscore 15.93 C
forward_pass 14.91 -
calc_score 12.93 D
reverse_pass 11.45 A
pdiff 5.85 -
*Note: A is Loop reversal, B is Loop fission, C is Type Casting, and D is Procedure call reduction
Profiled by Intel VTune
![Page 23: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/23.jpg)
Loop Reversal
That is to run a loop backward. Reversal of for loops is always legal, since the execution is not defined in terms of the order of the index set.
for (i=se2;i>0;i--){ HH[i] = -1; DD[i] = -1;}
for (i=1;i<=se2;i++){ HH[i] = -1; DD[i] = -1;}
23
![Page 24: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/24.jpg)
Loop Fission
A single loop can be broken into two or more smaller loops. Loop fission can break up the block of conditionally executed statements.
for (j=0;j<=N;j++){ hh = HH[j] + RR[j]; if (hh>=midh) if (HH[j]!=DD[j]&&RR[j]==SS[j]) { midh=hh; midj=j; }}
for (j=0;j<=N;j++){ temp[j] = HH[j] + RR[j];}
for (j=0;j<=N;j++){ if (temp[j]>=midh) if (HH[j]!=DD[j]&&RR[j]==SS[j]) { midh=temp[j]; midj=j; }} 24
![Page 25: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/25.jpg)
Limitation
Available compliers and programming languagesC/C++ Intel C++ complier (Windows,
Linux, Mac)Fortran Intel Fortran complier (Windows,
Linux, Mac) Available processors
CPU with Hyper-thread technology or above (Intel, AMD)
25
![Page 26: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/26.jpg)
Conclusion
Generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++
Proposed framework: multithreading and vectorizing strategies
Higher speedup by taking the advantage of multicore architecture technology
Proposed optimization could be more appropriate than making use of parallelization on a small cluster computer
26
![Page 27: InCoB2007 - August 30, 2007 - HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.](https://reader035.fdocuments.net/reader035/viewer/2022062802/56649ebc5503460f94bc5402/html5/thumbnails/27.jpg)
Questions?
Thank you
27