1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838...
-
date post
21-Dec-2015 -
Category
Documents
-
view
222 -
download
0
Transcript of 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838...
1
Parallel EST Clusteringby
Kalyanaraman, Aluru, and Kothari
Nargess MemarsadeghiCMSC 838 Presentation
CMSC 838T – Presentation 2
Talk Overview
Overview of talk Motivation Background Techniques Evaluation Related work Observations
CMSC 838T – Presentation 3
Motivation: EST Clustering
Problem: EST Clustering Cluster fragments of cDNA
Related to ‘fragment assembly’ problem Detecting overlapping fragments
Overlaps can be computed: Pairwise alignment algorithm Dynamic programming
Alternative: Approximate overlap detection algorithms Dynamic programming
CMSC 838T – Presentation 4
Motivation
Common Tools: Takes too long
Days for 100,000 ESTs Runs out of memory
This paper: PaCE:
Parallel Clustering of ESTs Efficient parallel EST Clustering
Space efficient algorithm Reduce total work Reduce run-time
CMSC 838T – Presentation 5
Background: EST Clustering Tools
Three traditional software: Originally designed for fragment assembly:
TIGR Assembler Phrap CAP3
One parallel software: UICLUSTER: assumes EST’s from 3’ end
CMSC 838T – Presentation 6
EST Clustering Tools
Basic approach Find pairs of similar sequences Align similar pairs
Dynamic programing
Quality of EST clustering Phrap: Fastest
avoids dynamic programming Relies on approximation, lower quality
CAP: Least # of erroneous clusters
CMSC 838T – Presentation 7
EST Clustering Tools’ Performance
With 50,000 maize ESTs Using PC with dual Pentium 450MHZ , 512 RAM :
TIGR: ran out of memory
Phrap: 40 min
CAP: > 24 hours
With 100,000 maize ESTs all ran out of memory
CAP would require 4 days
CMSC 838T – Presentation 8
Goal
Space efficient algorithm Space requirement linear in the size of the input data set
Reduce total work Without sacrificing quality of clustering
Reduce run-time and facilitate the clustering of large data sets Through parallel processing Scale memory with # of processors
CMSC 838T – Presentation 9
Approach
Expense: Pairwise alignment (time + memory) Promising pairs ≈
Common string: |s|= w Cost: if common |s|=l > w , then repeats l-w+1 times
2)(# EST
CMSC 838T – Presentation 10
Approach (Cont ..)
Approach: Use trie structure Identify promising pairs
Merge clusters with strong overlaps Avoid storing/testing all similar pairs
Parallel EST Clustering Software: Generalized Suffix Tree (GST) Multiple processors:
Maintain and updates EST Clusters Others generate batches of promising pairs, perform
alignment
CMSC 838T – Presentation 11
Approach (Cont …)
CMSC 838T – Presentation 12
Tries
1) Index for each char2) N leaves3) Height N
CMSC 838T – Presentation 13
Suffix Tries (Cont ..)
1) TRIM suffix trie
CMSC 838T – Presentation 14
Suffix Tries (Cont ..)
1) Indicies2) Storage O(n), constant is high though3) Common string4) Longest common substring
CMSC 838T – Presentation 15
Suffix Tries (Cont ..)
12
ab
ab
$
ab$
b
3
$4
$
5
Given a pattern P = ab we traverse the tree according to the pattern.
CMSC 838T – Presentation 16
Parallel Generation of GST
GST: Generalized Suffix Tree Compacted trie Longest common prefix found in constant time Used for on-demand pair generation Sequential: O(nl) Parallel: O(nl/p)
CMSC 838T – Presentation 17
Parallel Generation of GST (Cont …)
Previous implementations: CRCW/CREW PRAM model Work-optimal
Involves alphabetical ordering of characters Unrealistic assumptions
synchronous operation of processors infinite network bandwidth no memory contention Not practically efficient
CMSC 838T – Presentation 18
Parallel Generation of GST (Cont …)
Paper’s approach: EST’s equally distributed among processors Each processor
Partitions suffixes of ESTs into buckets Distribute buckets to the processors:
All suffixes in a bucket allocated to the same processor Total # of suffixes allocated to a processor ≈ O ( )
w||
p
nl
CMSC 838T – Presentation 19
Parallel Generation of GST (Cont …)
Each bucket’s processor: Compute compacted trie of all its suffixes Cannot use sequential construction
Suffixes of a string – not in the same bucket
Each bucket: Subtree in the GST
Nodes: Depth first search traversal of the trie Pointer to the right most child
CMSC 838T – Presentation 20
On-demand Pair Generation
A pair should be generated if Share substring of length ≥ treshhold Maximal Leaves in a common node
Share a substring of length = depth of node
Parallel algorithm Each processor works with its trie if
Depth of its root in GST < threshhold
CMSC 838T – Presentation 21
On-demand Pair Generation
To process Sort internal nodes
Decreasing order of depth Lists of a node
Generated after process Removed after parent is processed Limits space O(nl) Run time ≈ # pairs generated + cost of sorting Rejected pairs increase run-time by a factor of 2 Eliminating duplicates reduce run-time
CMSC 838T – Presentation 22
Parallel Clustering
Master-Slave paradigm: Master processor:
Maintains and updates clusters Using union-find data structure Receives messages from slave processors
– A batch of next promising pairs generated by slave– Results of the pairwise alignment
Determines which ones to explore Determines if merging should occur
Slave processors: Generate pairs on demand Perform pairwise alignments of pairs dispatched by the
master processor
CMSC 838T – Presentation 23
Parallel Clustering (Cont…)
Organization of Parallel Clustering Software
MasterP
SlaveP
SlaveP
slaveP
• Batch of promising pairs generated + results of pairwise alignment
• Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair
CMSC 838T – Presentation 24
Parallel Clustering (Cont..)
To start: Slave P starts with 3× batchsize pairs
Sends the 3rd batch to Master P Starts alignment on 1st batch Sends results on 1st + a newly generated batch While waiting to receive results from Master P, aligns 2nd batch
Processor always has the next batch to work between: – Submitting the results of previous batch– Receiving another set of pairs
CMSC 838T – Presentation 25
Parallel Clustering (Cont..)
Improve and control quality Parameters:
Match and mismatch scores Gap penalties
Post processing: Detection of alternating splicing Consulting protein databases Organism specific
CMSC 838T – Presentation 26
Experimental environment
Used C and MPI Tested
Quality of software: Arabidopsis thaliana (due to availability of its genome)
Run-time behavior: 50,000 Maize ESTs with 32-processor IBM SP # of processors Data size (# of Promising pairs) vs data size Batchsize vs (# processors) # of Clusters Master processor’s time
CMSC 838T – Presentation 27
Quality Assessment
To asses quality A data set and its correct clustering ESTs from plant Arabidopsis thaliana Splice program
Align ESTs to the genome Discard ESTs that
Don’t align Aligned in multiple spots
CMSC 838T – Presentation 28
Quality Assessment (Cont …)
False negative: A pair in correct clustering is not paired in the output 5%
False positive: A pair not in correct clustering appears in results Negligible (< 0.04%) Due to conservative nature of algorithm
CMSC 838T – Presentation 29
Quality Assessment
Cluster results
Number of singleton clusters
Number of non-singleton clusters
Benchmark 10,803 18,727
CAP3 17,930 17,556
PaCE 14,802 19,536
Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.
CMSC 838T – Presentation 30
Quality Assessment (Cont..)
CMSC 838T – Presentation 31
Run-time Assessment
-Experiment with 50,000 maize ESTs:-32-processor IBM SP-2-16 minutes
CMSC 838T – Presentation 32
Run-time Assessment (Cont …)
p Preprocessing Clustering Total
4 273 102 375
8 119 50 169
16 61 26 87
32 38 15 53
64 29 10 39
Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors.
CMSC 838T – Presentation 33
Run-time Assessment (Cont ..)
Run-time as a function of batchsize Small batchsize
Increase in communication overhead Large batchsize
Slaves less responsive to the need of generating pairs Slave does not use latest clustering results
Optimal batchsize Determined by experiment
Master processor’s time Fixed batchsize, increase in # of processors
Gradual increase in Master P’s time With 32 processors, increase < 1% Using 1 Master Processor in not bottleneck
CMSC 838T – Presentation 34
Results
Space Linear in size of the input data set Reduced total work without sacrificing quality Reduced run-time
Parallel processors Eliminating pairs
Faciliate clustering Scale memory with # Processors
CMSC 838T – Presentation 35
Observations
PaCE: Approaches EST clustering problem directly Better than
CAP3 Phrap TIGR Assembler
Compare time/quality TIGICL (TIGR Indices Clustering Tool)
Support for PVM MegaBlast STACK
Large data sets Lots of Processors
Can improve clustering time? Clustering algorithm
CMSC 838T – Presentation 36
References
http://www.cs.berkeley.edu/~kubitron/courses/cs258-S02/lectures/eval10-logp.pdf
Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.