28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...
-
Upload
allyson-spikes -
Category
Documents
-
view
216 -
download
0
Transcript of 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...
![Page 1: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/1.jpg)
28 April, 2005 ISGC 2005, Taiwan
The Efficient Handling of BLAST Applications on the GRID
Hurng-Chun Lee1 and Jakub Moscicki2
1 Academia Sinica Computing Centre, Taiwan2 CERN IT-GD-ED, Switzerland
![Page 2: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/2.jpg)
28 April, 2005 ISGC 2005, Taiwan
Outline
• The consideration of distributing BLAST jobs• The master-worker computing model of BLAST
– mpiBLAST
• The Gridified BLAST– mpiBLAST-g2 vs. DIANE-BLAST
• Summary
![Page 3: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/3.jpg)
28 April, 2005 ISGC 2005, Taiwan
The considerations of distributing BLAST jobs
• BLAST has been widely and routinely used for sequence analysis
• The essential component in most of bioinformatics and life science applications
• Problem Complexity ~ O(SqxSd)– Sq : The query size– Sd : The database size
• In most cases, Sd >> Sq
– e.g. Sq ~ O(MB), Sd ~ O(GB)– The cost of moving query is lower
• Database management, storage and sharing issues– Replication, Archive– Privacy, Security
• Other perspective for service providing– scalability, robustness
![Page 4: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/4.jpg)
28 April, 2005 ISGC 2005, Taiwan
The master-worker model of BLAST
• Database splitting is the easiest way to distribute BLAST jobs
• Fragmented databases for avoiding the memory swapping
• Each sub task can be 100% independent
• Each worker requests the tasks from master (pull model) and runs the normal BLAST search
• The individual result can be easily merged by master process
• Report generation (BioSeq fetching)
• Multi-query blast search can be easily split to multiple independent single-query blast search by a trivial script
– Master-worker model can also be applied in each single-query search
Database
Master
workers
DB Fragments
Task list
Job requesting
Result merging
formatdb
blast search
BioSeq fetching
![Page 5: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/5.jpg)
28 April, 2005 ISGC 2005, Taiwan
mpiBLASTLANL, US http://mpiblast.lanl.gov
• The MPI implementation of BLAST master-worker model
• Advantages– High throughput– Load Balancing
• Running in local cluster– Performance and Problem
size still be limited by local computing power
– Simultaneous I/O to centralized database causes the performance bottleneck
– Database sharing is still difficult
![Page 6: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/6.jpg)
28 April, 2005 ISGC 2005, Taiwan
mpiBLAST-g2 ASCC, Taiwan and PRAGMA http://bits.sinica.edu.tw/mpiBlast/index_en.php
• A GT2-enabled parallel BLAST runs on Grid– GT2 GASSCOPY API– MPICH-g2
• The enhancement from mpiBLAST by ASCC
• Performing cross cluster scheme of job execution
• Performing remote database sharing
• Help Tools for– database replication– automatic resource specification and job submi
ssion (with static resource table)– multi-query job splitting and result merging
• Close link with mpiBLAST development team– The new patches of mpiBLAST can be quickly
applied in mpiBLAST-g2
![Page 7: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/7.jpg)
28 April, 2005 ISGC 2005, Taiwan
SC2004 mpiBLAST-g2 demonstration
KISTI
![Page 8: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/8.jpg)
28 April, 2005 ISGC 2005, Taiwan
mpiBLAST-g2 current deployment
-- From PRAGMA GOC http://pragma-goc.rocksclusters.org
![Page 9: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/9.jpg)
28 April, 2005 ISGC 2005, Taiwan
mpiBLAST-g2Performance Evaluation (perfect case)
Elapsed time Speedup
Database: est_human ~ 3.5 GBytesQueries: 441 test sequences ~ 300 KBytes • Overall speedup is approximately linear
— Searching + Merging
— BioSeq fetching
— Overall
![Page 10: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/10.jpg)
28 April, 2005 ISGC 2005, Taiwan
mpiBLAST-g2Performance Evaluation (worse case)
Elapsed time Speedup
Database: drosophila NT ~ 122 MBytesQueries: 441 test sequences ~ 300 KBytes
• The overall speedup is limited by the unscalable BioSeq fetching
— Searching + Merging
— BioSeq fetching
— Overall
![Page 11: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/11.jpg)
28 April, 2005 ISGC 2005, Taiwan
Issues of mpiBLAST-g2
• Single error will crash the whole job– The MPICH nature – Error might be due to the transient problem on the loosely coupled Grid
environment
• MPI Job will be started only when all resources are available– Different level of resource availability
Error recovery is required for– providing a robust application service on the Grid– efficiently using the Grid resources
Asynchronous task dispatching/pulling to use the available resources immediately
![Page 12: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/12.jpg)
28 April, 2005 ISGC 2005, Taiwan
The DIANEhttp://cern.ch/diane
• DIstributed ANalysis Environment
• Lightweight distributed framework for parallel scientific applications in master-worker model– A perfect match of the mpiBLAST computing model
• Current applications– BLAST for Genomic Sequence Analysis (DIANE-BLAST)– Geant4 Simulation for Radiotherapy and Astrophysics – Image Rendering – Data Analysis for High Energy Physics
![Page 13: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/13.jpg)
28 April, 2005 ISGC 2005, Taiwan
DIANE Features
• Rapid prototyping– Python and CORBA
• Error recovery– Heartbeat worker health check– Resubmission of failed tasks– User defined error recovery method
• No need of outbound connectivity– Proxy of workers with only private IP
• Job submitters for– Simple fork– Condor, LSF, SGE, PBS– GT2, LCG, gLite
Pull Model
Batch and Interactive
Distributed workers
• planner• integrator
![Page 14: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/14.jpg)
28 April, 2005 ISGC 2005, Taiwan
DIANE-BLAST implementation
• Splitting mpiBLAST-g2 to DIANE components– Master (Planner and Integrator), Worker
• Wrapping each component with Python– Hooking core BLAST C libraries with python swig
• Implementing the DIANE GT2 job submitter– For running workers on the GT2-enabled clusters
• Reusing the deployed databases for mpiBLAST-g2
![Page 15: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/15.jpg)
28 April, 2005 ISGC 2005, Taiwan
mpiBLAST-g2 vs. DIANE-BLASTThe Speedup
• Query– Drosophila chromosome 4– size: 1.2 Mbps
• DB– Drosophila nucleotide sequence
database– size: 1170 seq. 122 Mbps– no. fragments: 32
• Computing Resource– Available # of CPU: 12– PIII 1.4GHz– 1GByte Memory
Speedup of mpiBLAST-g2
Speedup of DIANE-BLAST
![Page 16: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/16.jpg)
28 April, 2005 ISGC 2005, Taiwan
mpiBLAST-g2 vs. DIANE-BLAST The Worker Lifeline
DIANE-BLAST task dispatching
• Handled by DIANE’s task thread
• Due to the bugs in the current DIANE release
DIANE-BLAST task dispatching
• Handled by DIANE’s task thread
• Due to the bugs in the current DIANE release
mpiBLAST-g2 task dispatching
• mpiBLAST-g2 task handling logic
mpiBLAST-g2 task dispatching
• mpiBLAST-g2 task handling logic
![Page 17: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/17.jpg)
28 April, 2005 ISGC 2005, Taiwan
mpiBLAST-g2 vs. DIANE-BLASTOverall Comparisons
• mpiBLAST-g2– Master-Worker model implemented by
using MPICH-g2 libraries
– Gridification efforts• Implementing database sharing with GA
SSCOPY API• Recompilation with MPICH-g2 and GT2
libraries
– Error recovery• Need the fault-tolerance MPI
– Cross cluster computation• Requiring outbound connectivity on eac
h worker
– Performance/Throughput• In cluster performance is as well as the
original mpiBLAST
• DIANE-BLAST– Pluggable application for DIANE Maste
r-Worker framework
– Gridification efforts• Through the gridified DIANE framework
– Error recovery• Task resubmission• Tracking the health of each worker
– Cross cluster computation• Using proxy for workers with private IPs
– Performance/Throughput• Performance can be tuned by controllin
g the job thread
![Page 18: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/18.jpg)
28 April, 2005 ISGC 2005, Taiwan
Summary
• Two grid-enabled BLAST implementations (mpiBLAST-g2 and DIANE-BLAST) were introduced for efficient handling the BLAST jobs on the Grid
• Both implementations are based on the Master-Worker model for distributing BLAST jobs on the Grid
• The mpiBLAST-g2 has good scalability and speedup in some cases– Require the fault-tolerance MPI implementation for error recovery – In the unscalable cases, BioSeq fetching is the bottleneck
• DIANE-BLAST provides flexible mechanism for error recovery– Any master-worker workflow can be easily plugged into this framework– The job thread control should be improved to achieving the good perfor
mance and scalability
![Page 19: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.](https://reader036.fdocuments.net/reader036/viewer/2022062712/56649c7b5503460f9492fc44/html5/thumbnails/19.jpg)
28 April, 2005 ISGC 2005, Taiwan
Thanks for your attention!!