1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838...

1

Parallel EST Clusteringby

Kalyanaraman, Aluru, and Kothari

Nargess MemarsadeghiCMSC 838 Presentation

http://www.cs.umd.edu/~tseng/chauwenben.jpg

CMSC 838T – Presentation 2

Talk Overview

Overview of talk Motivation Background Techniques Evaluation Related work Observations


Motivation: EST Clustering

Problem: EST Clustering Cluster fragments of cDNA

Related to ‘fragment assembly’ problem Detecting overlapping fragments

Overlaps can be computed: Pairwise alignment algorithm Dynamic programming

Alternative: Approximate overlap detection algorithms Dynamic programming


Motivation

Common Tools: Takes too long

Days for 100,000 ESTs Runs out of memory

This paper: PaCE:

Parallel Clustering of ESTs Efficient parallel EST Clustering

Space efficient algorithm Reduce total work Reduce run-time


Background: EST Clustering Tools

Three traditional software: Originally designed for fragment assembly:

TIGR Assembler Phrap CAP3

One parallel software: UICLUSTER: assumes EST’s from 3’ end


EST Clustering Tools

Basic approach Find pairs of similar sequences Align similar pairs

Dynamic programing

Quality of EST clustering Phrap: Fastest

avoids dynamic programming Relies on approximation, lower quality

CAP: Least # of erroneous clusters


EST Clustering Tools’ Performance

With 50,000 maize ESTs Using PC with dual Pentium 450MHZ , 512 RAM :

TIGR: ran out of memory

Phrap: 40 min

CAP: > 24 hours

With 100,000 maize ESTs all ran out of memory

CAP would require 4 days


Goal

Space efficient algorithm Space requirement linear in the size of the input data set

Reduce total work Without sacrificing quality of clustering

Reduce run-time and facilitate the clustering of large data sets Through parallel processing Scale memory with # of processors


Approach

Expense: Pairwise alignment (time + memory) Promising pairs ≈

Common string: |s|= w Cost: if common |s|=l > w , then repeats l-w+1 times

2)(# EST


Approach (Cont ..)

Approach: Use trie structure Identify promising pairs

Merge clusters with strong overlaps Avoid storing/testing all similar pairs

Parallel EST Clustering Software: Generalized Suffix Tree (GST) Multiple processors:

Maintain and updates EST Clusters Others generate batches of promising pairs, perform

alignment


Approach (Cont …)


Tries

1) Index for each char2) N leaves3) Height N


Suffix Tries (Cont ..)

1) TRIM suffix trie



1) Indicies2) Storage O(n), constant is high though3) Common string4) Longest common substring



12

ab

ab

$

ab$

b

3

$4

$

5

Given a pattern P = ab we traverse the tree according to the pattern.


Parallel Generation of GST

GST: Generalized Suffix Tree Compacted trie Longest common prefix found in constant time Used for on-demand pair generation Sequential: O(nl) Parallel: O(nl/p)


Parallel Generation of GST (Cont …)

Previous implementations: CRCW/CREW PRAM model Work-optimal

Involves alphabetical ordering of characters Unrealistic assumptions

synchronous operation of processors infinite network bandwidth no memory contention Not practically efficient



Paper’s approach: EST’s equally distributed among processors Each processor

Partitions suffixes of ESTs into buckets Distribute buckets to the processors:

All suffixes in a bucket allocated to the same processor Total # of suffixes allocated to a processor ≈ O ( )

w||

p

nl



Each bucket’s processor: Compute compacted trie of all its suffixes Cannot use sequential construction

Suffixes of a string – not in the same bucket

Each bucket: Subtree in the GST

Nodes: Depth first search traversal of the trie Pointer to the right most child


On-demand Pair Generation

A pair should be generated if Share substring of length ≥ treshhold Maximal Leaves in a common node

Share a substring of length = depth of node

Parallel algorithm Each processor works with its trie if

Depth of its root in GST < threshhold


On-demand Pair Generation

To process Sort internal nodes

Decreasing order of depth Lists of a node

Generated after process Removed after parent is processed Limits space O(nl) Run time ≈ # pairs generated + cost of sorting Rejected pairs increase run-time by a factor of 2 Eliminating duplicates reduce run-time


Parallel Clustering

Master-Slave paradigm: Master processor:

Maintains and updates clusters Using union-find data structure Receives messages from slave processors

– A batch of next promising pairs generated by slave– Results of the pairwise alignment

Determines which ones to explore Determines if merging should occur

Slave processors: Generate pairs on demand Perform pairwise alignments of pairs dispatched by the

master processor


Parallel Clustering (Cont…)

Organization of Parallel Clustering Software

MasterP

SlaveP

SlaveP

slaveP

• Batch of promising pairs generated + results of pairwise alignment

• Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair


Parallel Clustering (Cont..)

To start: Slave P starts with 3× batchsize pairs

Sends the 3rd batch to Master P Starts alignment on 1st batch Sends results on 1st + a newly generated batch While waiting to receive results from Master P, aligns 2nd batch

Processor always has the next batch to work between: – Submitting the results of previous batch– Receiving another set of pairs


Parallel Clustering (Cont..)

Improve and control quality Parameters:

Match and mismatch scores Gap penalties

Post processing: Detection of alternating splicing Consulting protein databases Organism specific


Experimental environment

Used C and MPI Tested

Quality of software: Arabidopsis thaliana (due to availability of its genome)

Run-time behavior: 50,000 Maize ESTs with 32-processor IBM SP # of processors Data size (# of Promising pairs) vs data size Batchsize vs (# processors) # of Clusters Master processor’s time


Quality Assessment

To asses quality A data set and its correct clustering ESTs from plant Arabidopsis thaliana Splice program

Align ESTs to the genome Discard ESTs that

Don’t align Aligned in multiple spots


Quality Assessment (Cont …)

False negative: A pair in correct clustering is not paired in the output 5%

False positive: A pair not in correct clustering appears in results Negligible (< 0.04%) Due to conservative nature of algorithm


Quality Assessment

Cluster results

Number of singleton clusters

Number of non-singleton clusters

Benchmark 10,803 18,727

CAP3 17,930 17,556

PaCE 14,802 19,536

Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.


Quality Assessment (Cont..)


Run-time Assessment

-Experiment with 50,000 maize ESTs:-32-processor IBM SP-2-16 minutes


Run-time Assessment (Cont …)

p Preprocessing Clustering Total

4 273 102 375

8 119 50 169

16 61 26 87

32 38 15 53

64 29 10 39

Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors.


Run-time Assessment (Cont ..)

Run-time as a function of batchsize Small batchsize

Increase in communication overhead Large batchsize

Slaves less responsive to the need of generating pairs Slave does not use latest clustering results

Optimal batchsize Determined by experiment

Master processor’s time Fixed batchsize, increase in # of processors

Gradual increase in Master P’s time With 32 processors, increase < 1% Using 1 Master Processor in not bottleneck


Results

Space Linear in size of the input data set Reduced total work without sacrificing quality Reduced run-time

Parallel processors Eliminating pairs

Faciliate clustering Scale memory with # Processors


Observations

PaCE: Approaches EST clustering problem directly Better than

CAP3 Phrap TIGR Assembler

Compare time/quality TIGICL (TIGR Indices Clustering Tool)

Support for PVM MegaBlast STACK

Large data sets Lots of Processors

Can improve clustering time? Clustering algorithm


References

http://www.cs.berkeley.edu/~kubitron/courses/cs258-S02/lectures/eval10-logp.pdf

Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.



1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838...

Documents

Transcript of 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838...