How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using:...

41
How Parallelism Is Used How Parallelism Is Used In Bioinformatics In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local BLAST Service on Workstation Clusters - Braun, Pedretti, Casavant, Scheetz, Birkett, Roberts

Transcript of How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using:...

How Parallelism Is Used In How Parallelism Is Used In BioinformaticsBioinformatics

Presented by: Laura L. Neureuter

April 9, 2001

Using: Three Complimentary Approaches to Parallelization of Local BLAST Service on Workstation Clusters

- Braun, Pedretti, Casavant, Scheetz, Birkett, Roberts

OverviewOverview

1. A Unique Approach to Information Gathering

2. Types of Architecture Used

3. Software Packages Used

4. How Parallelism is used in BLAST

a. Background

b. Granularity

c. Sequence to Sequence Comparison

d. Parallelization of Single Query across Partitioned Database

e. Partitioned Set Of Queries Across Set of Servers

A Unique Approach to A Unique Approach to Information GatheringInformation Gathering

ASK SOMEONE !!

Alejandro Shaffer Says…Alejandro Shaffer Says…

Parallel Computing is Used to Analyze:Parallel Computing is Used to Analyze:

• Protein Sequence Data

• DNA Sequence Data

• Protein Structure Data

• Genetic Inheritance Data

- Among Others

Alejandro Shaffer Says…Alejandro Shaffer Says…

Parallel Bioinformatics Computations areParallel Bioinformatics Computations areRun on the Followng Architectures:Run on the Followng Architectures:

• Small Shared Memory Multiprocessor• Loosely coupled network of processors

Alejandro Shaffer Says…Alejandro Shaffer Says…

The assembly of the

Human Genome

was done on the loosely coupled network of computers.

Alejandro Shaffer Says…Alejandro Shaffer Says…

Two Software Packages Used by Two Software Packages Used by Bioinformaticists That Run On Bioinformaticists That Run On

Parallel Computers:Parallel Computers:

• BLAST

• FASTLINK

BLASTBLAST

Analyzes Protein or DNA Sequences:Analyzes Protein or DNA Sequences:

Takes input sequences and searches large databases for similar sequences.

FASTLINKFASTLINK

Used to hunt the approximate chromosomal location of disease causing genes.

- leaving this topic open for someone else to research.

BLASTBLAST

BBasic asic LLocal ocal AAlignment lignment SSearch earch TToolool

• The most common Sequence Comparison tool.

BLASTBLAST

Three Parallel Components to BLASTThree Parallel Components to BLAST

1. Sequence to Sequence Comparison Level

2. Parallelization of a single query across a distributed database

3. A set of queries is partitioned across a set of servers with either a replicated or partitioned database.

BLASTBLAST

At the time the paper was published – December 15, 1999 – the only completed implementation was the third step:

Parallelizing Batch Requests

First – Some BackgroundFirst – Some Background

“The basic nature of the entire process of gene discovery is highly parallel, heterogenous, and distributed.”

BackgroundBackground

At the time of the publishing of this paper, the current mode used by 90% of researchers is to submit single queries for comparison of sequence data (300-600 chars) against one or more databases

(GenBank)

BackgroundBackground

Paper predicted that once the Paper predicted that once the human genome was finished, the human genome was finished, the frequency and intensity of frequency and intensity of inquiries against data will increase inquiries against data will increase exponentially.exponentially.

We’ve all seen the graph (several times) that proves this is true.

BackgroundBackground

Problems:Problems:

1) Cluster of servers continues to diminish in its ability to serve the increasing number of requests.

2) Network traffic is becoming intolerable.

3) Database is growing at increasing rate.

4) Single queries are time consuming.

Refresher …Refresher …

GranularityGranularity

Defined as the size of the computation between communication or synchronization points.

Course – Each process contains a large number of sequential instructions and takes a substantial time to execute.

Fine – Each process consists of a few, or even one instruction.

Medium – Middle ground.

RefresherRefresher

GranularityGranularity- Granularity is related to the number of processors being used.

MetricMetric

Computation/Communication ratio = tcomp/tcomm

Important to maximize ratio while maintaining parallelism

Three levels of ParallelismThree levels of ParallelismExploitable in BLASTExploitable in BLAST

1 sequence1 sequence 1 sequence1 sequence N sequencesN sequences(batch request)(batch request)

1 sequence1 sequence M sequencesM sequences (in database)(in database)

M sequencesM sequences(in database)(in database)

Mult. alignments on Mult. alignments on single single

sequence pairssequence pairs

Partition databasePartition database

Multiple targetsMultiple targets

examined at onceexamined at once

Replicate Replicate

Database – Database –

Partition inputPartition input

setssets

Fine Grained Medium Grained Course Grained

Subject(s)

Target(s)

Parallelism

BLASTBLAST

BLAST is a heuristic search algorithmBLAST is a heuristic search algorithm

Heuristic:

• Process of elimination and compromise by using the “what if” theory.

• An educated guess that reduces or limits the search for solutions.

•A method of solving problems by intelligent trial and error.

BLASTBLAST

Five variations of BLASTFive variations of BLAST

• blastn• blastx• tblastx• blastp• tblastn

BLASTBLAST

blastnblastn

Compares a nucleotide sequence against a nucleotide database

(Relatively quick)

BLASTBLAST

blastxblastx

Compares a nucleotide sequence against a protein database.

Nucleotide “subject” needs to be translated into a peptide sequence – since 6 different translations, the basic blast algorithm must be applied 6 times.

BLASTBLAST

tblastxtblastx

Compares nucleotide sequence to nucleotide database, only each is translated (in all 6 reading frames) into a peptide sequence before blasting.

This is the most computationally intesive BLAST algorithm – must be invoked 36 times for each sequence to sequence comparison.

BLASTBLAST

blastpblastp

Compares a peptide sequence to a peptide database

(Relatively quick)

BLASTBLAST

tblastntblastn

Compares a peptide sequence against a nucleotide database

Requires 6 calls to BLAST

BLASTBLAST

Benefits of Parallelizing Local BLASTBenefits of Parallelizing Local BLAST• Reduces processing time in relation to number of compute nodes utilized.

• Reduces costs by utilizing commodity workstations and PCs.

• A locally-scheduled parallel algorithm allows prioritization and control over individual searches.

Types Of ParallelismTypes Of Parallelism

I. Pairwise Multiple AlignmentI. Pairwise Multiple AlignmentFancy term for earlier description of variations on BLAST algorithm. Since the comparisons are mutually independent, the parallelization of the comparisons is potentially very efficient.

Of greatest importance would be a high-speed, low-latency interconnection network to allow rapid selection and scoring of the best possible alignment.

Effective implementation would greatly benefit from specialized hardware.

Types of ParallelismTypes of Parallelism

II. Database PartitioningII. Database Partitioning

Distributing chunks of the database across a collection of compute nodes.

Master node coordinates the scheduling of jobs and collates the results from each submission.

Types of ParallelismTypes of Parallelism

III. Batch ModeIII. Batch Mode

Scheduling sets of queries, while keeping full copies of the database stored on each compute node.

This type of parallelism is currently in place and being used .

Batch ModeBatch Mode

The foundation of the local batch BLAST system is the Portable Batch System developed for NASA.

PBS is comprised of three parts:PBS is comprised of three parts:

• The Job Server

• The Scheduler

• Compute Nodes

Batch ModeBatch Mode

The Job ServerThe Job Server

Responsible for managing two queues of incoming jobs – one for batch blast jobs, the other for jobs interactively submitted to local BLAST through a web interface.

Batch ModeBatch Mode

The SchedulerThe Scheduler

Applies job scheduling algorithm to allocate compute nodes to jobs in the two incoming job queues.

Some nodes have several CPUs and can handle more than one simultaneous blast job. The scheduler assigns multiple jobs to such nodes.

Batch ModeBatch Mode

Compute NodesCompute Nodes

• Each node has a monitor that communicates with job server.

•Each node has own set of sequence databases.

Job TypesJob Types

1) Batch jobs – can be executed at any time and restarted if necessary.

2) Interactive jobs – time critical and should have priority over batch jobs.

Job TypesJob Types

At time paper was published, the current implementation was as follows:

• 75% of compute nodes execute batch jobs

• 25% always available for interactive web jobs.

• if no batch jobs, all 100% are available for web jobs – neither type of job will be starved of resources with this approach.

Issues with Batch ModeIssues with Batch Mode

• All replicated databases must be updated periodically to reflect the most recent contents of globally shared db.

• All nodes copies must be consistent with one another.

Otherwise, results of the query would depend on which compute node processed it.

Considerations…Considerations…

A Networked File System is being considered where there would be several I/O servers in a system, each with a complete copy of database. Compute nodes would rely on these I/O servers for access to database.

Next Step…Next Step…

The partitioned database implementation will utilize many of the concepts developed for the course-grained implementation, but the scheduler would need to know which nodes had which section of the database.

Next Step…Next Step…

Outputs would then need to be combined into single output file

This is Non-Trivial

Merge program must parse, sort, and correct data from nodes, and E values must be corrected to reflect larger database size.

Questions ???Questions ???

One of My Own One of My Own

Since this paper was published in 1999, have all three levels of parallelism described here been exploited by now?

- haven’t found the answer.