Bert Schneider, Walter Matthau, Charlie Chaplin, Oona Chaplin
Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel...
-
Upload
talia-russo -
Category
Documents
-
view
223 -
download
2
Transcript of Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, arc Parallel...
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Parallel Computational Biochemistry
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Proteins, DNA, etc.
DNA encodes the information necessary to produce proteins
Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
• Proteins are formed from a chain of molecules called amino acids
Proteins, DNA, etc.
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
• The DNA sequence encodes the amino acid sequence that constitutes the protein
Proteins, DNA, etc.
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
• There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I, ...
Proteins, DNA, etc.
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Multiple Sequence Alignment
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Databases of Biological Sequences
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
NCBI: 14,976,310 sequences
15,849,921,438 nucleotides
Swiss-Prot: 104,559 sequences
38,460,707 residues
PDB: 17,175 structures
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Sequence comparison
• Compare one sequence (target) to many sequences (database search)
• Compare more than two sequences simultaneously
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Applications
• Phylogenetic analysis
• Identification of conserved motifs and domains
• Structure prediction
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Phylogenetic Analysis
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Structure Prediction
Genomic sequences
> RICIN GLYCOSIDASEMYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSGDLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDESKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYHWPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDEYSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGIKSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITRGNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVSLAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPYYLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNTKRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Protein sequences
Protein structures
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Clustal W
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Progressive Alignment
Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289
S.cerevisiaeC.elegans
DrosophilaMouse
Human
1. Do pairwise alignment of all sequences and calculate distance matrix
2. Create a guide tree based on this pairwise distance mat
3. Align progressively following guide tree. • start by aligning most closely related pairs of sequences• at each step align two sequences or one to an existing subalignment
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Parallel Clustal• Parallel pairwise
(PW) alignment matrix
• Parallel guide tree calculation
• Parallel progressive alignment
Scerevisiae [1]Celegans [2] 0.640Drosophia [3] 0.634 0.327Human [4] 0.630 0.408 0.420Mouse [5] 0.619 0.405 0.469 0.289
S.cerevisiaeC.elegans
DrosophilaMouse
Human
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Parallel Clustal - Improvements
• Optimization of input parameters– scoring matrices, gap penalties - requires
many repetitive Clustal W calculations with various input parameters.
• Minimum Vertex Cover– use minimum vertex cover to remove
erroneous sequences, and identify clusters of highly similar sequences.
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Minimum Vertex Cover
Conflict Graph– vertex: sequence– edge: conflict (e.g.
alignment with very poor score)
TASK: remove smallest number of gene sequences that eliminates all conflicts
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
FPT Algorithms
• Phase 1: Kernelization
Reduce problem to size f(k)
• Phase 2: Bounded Tree Search
Exhausive tree search; exponential in f(k)
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Kernelization
Buss's Algorithm for k-vertex cover
• Let G=(V,E) and let S be the subset of vertices with degree k or more.
• Remove S and all incident edges
G->G’ k -> k'=k-|S|.
• IF G' has more than k x k' edges THEN no k-vertex cover exists
ELSE start bounded tree search on G'
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Bounded Tree Search
VC={}
VC+=... VC+=... VC+=...
VC+=... VC+=... VC+=...
VC+=... VC+=... VC+=...
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Case 1: simple path of length 3
VC+={v,v2}
VC={...}
VC+={v1,v2} VC+={v1,v3}
search tree
v
v1
v2
v3
in graph G'
remove selected vertices from G'k' - = 2
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Case 2: 3-cycle
v
v1
v2
in graph G'
VC+={v,v1}
VC={...}
VC+={v1,v2} VC+={v,v2}
search tree
remove selected vertices from G'k' - = 2
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Case 3: simple path of length 2
v
v1
v2
in graph G'
VC={...}
VC+={v1}
search tree
remove v1, v2 from G'k' - = 1
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Case 4: simple path of length 1
v
v1
in graph G'
VC={...}
VC+={v}
search tree
remove v, v1 from G'k' - = 1
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Sequential Tree Search
Depth first search
– backtrack when k'=0 and G'<>0 ("dead end" ))
– stop when solution found (G'={}, k'>=0 )
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Parallel Tree SearchBasic Idea:
– Build top log p levels of the search tree (T ')
– every proc. starts depth-first search at one leaf of T '
– randomize depth-first search by selecting random child
T 'log p
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Analysis: Balls-in-bins
sequential depth-first search path total length:L, #solutions: m
expected sequential time (rand. distr.): L/(m+1)
parallel search path
expected parallel time (rand. distr.): p + L/(p(m+1))expected speedup: p / (1 + (m+1)/L)if m << L then expected speedup = p
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Simulation Experiment
number of processors
0 50
50
pre
dict
ed s
pee
dup
L = 1,000,000
m = 10m = 100m = 1,000m = 10,000m = 100,000
100
150
200
100 150 200
L = 1,000,000
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Implementation
• test platform:– 32 node Beowulf cluster– each node: dual 1.4 GHz Intel Xeon, 512
MB RAM, 60 GB disk– gcc and LAM/MPI on LINUX Redhat 7.2
• code-s: Sequential k-vertex cover
• code-p: Parallel k-vertex cover
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
HPCVL
High Performance Computing Virtual Laboratory - HPCVL (www.hpcvl.org)
Created by parallel computing researchers fromCarleton U. (Comp. Sci.)Queen's (Engineering)Ottawa U. (Life Sci./Hospital)
Obtained $30M+ in Federal (CFI) and Ontario (OIT, ORDCF) grants
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
• Protein sequences
• Same protein from several hundred species
• Each protein sequence a few hundred amino acid residues in length
• Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
• Somatostatin
– neuropeptide involved in the regulation of many functions in different organ systems
– Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
• WW
– small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling
– Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
• Kinase
– large family of enzymes involved in cellular regulation
– Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
• SH2 (src-homology domain 2)
– involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine
– Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
• Thrombin
– protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin
– Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
• PHD (pleckstrin homology domain)
– involved in cellular signaling
– Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
• Random Graph
|V| = 220, |E| = 2155, k = 122, k' = 122
• Grid Graph
|V| = 289, |E| = 544, k = 145, k' = 145
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Test Data
|VC| ~ |V| / 2 k' = k
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Sequential Times
Kinase, SH2, Thombin: n/a
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Code-p on Virtual Proc.
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Parallel Times
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Speedup: Somatostatin
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Speedup: WW
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Speedup: Rand. Graph
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Speedup: Grid Graph
Faculty of Computer ScienceDalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc
Thank You!
• Questions?