Approximation of Protein Structure for Fast Similarity Measures
description
Transcript of Approximation of Protein Structure for Fast Similarity Measures
![Page 1: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/1.jpg)
Approximation of Protein Structure for Fast Similarity
Measures
Fabian SchwarzerItay Lotan
Stanford University
![Page 2: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/2.jpg)
Comparing Protein Structures
vs.Same protein:
Analysis of MDS
and MCS
trajectories
http://folding.stanford.edu
Structure prediction applications• Evaluating decoy sets
• Clustering predictions (Shortle et al, Biophysics ’98)
Graph-based methods
Stochastic Roadmap Simulation (Apaydin et al, RECOMB ’02)
![Page 3: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/3.jpg)
k Nearest-Neighbors Problem
Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c.
Can be done in
N – size of S
L – time to compare two conformations
(log )O N k L
![Page 4: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/4.jpg)
k Nearest-Neighbors Problem
What if needed for all c in S ?
2 (log )O N k L - too much time
Can be improved by:
1. Reducing L
2. A more efficient algorithm
![Page 5: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/5.jpg)
Our Solution
Reduce structure description
Approximate but fast similarity measures
Efficient nearest-neighbor algorithms can be used
Reduce description further
![Page 6: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/6.jpg)
Description of a Protein’s Structure
3n coordinates of Cα atoms (n – Number of residues)
![Page 7: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/7.jpg)
Similarity Measures - cRMS
The RMS of the distances between corresponding atoms after the two conformations are optimally aligned
2
21
1( , ) min
n
T i ii
cRMS P Q p Tqn
Computed in O(n) time
![Page 8: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/8.jpg)
Similarity Measures - dRMS
The Euclidean distance between the intra-molecular distances matrices of the two conformations
2
2 1
2( , )
( 1)
n iP Qij ij
i j
dRMS P Q d dn n
Computed in O(n2) time
![Page 9: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/9.jpg)
m-Averaged Approximation Cut chain into m pieces Replace each sequence of n/m Cα
atoms by its centroid
3n coordinates
3m coordinates
![Page 10: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/10.jpg)
Why m-Averaging?
Averaging reduces description of random chains with small error
Demonstrated through Haar wavelet analysis
Protein backbones behave on average like random chains Chain topology Limited compactness
![Page 11: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/11.jpg)
Evaluation: Test Sets
1. Decoy sets: conformations from the Park-Levitt set (Park & Levitt, JMB ’96), N = 10,000
2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, Proteins ’00), N = 5000
9 structurally diverse proteins of size 38 -76 residues:
![Page 12: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/12.jpg)
Decoy Sets Correlation
m cRMS dRMS
8
121620
4 0.37 – 0.730.84 – 0.980.98 – 0.990.98 – 0.990.98 – 0.99
0.40 – 0.860.70 – 0.940.92 – 0.960.92 – 0.980.93 – 0.97Higher Correlation for random
sets!
![Page 13: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/13.jpg)
Speed-up for Decoy Sets
Between 5X and 8X for cRMS (m = 8) Between 9X and 36X for dRMS (m =
12)with very small error
For random sets the speed-up for dRMS was between 25X and 64X (m = 8)
![Page 14: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/14.jpg)
Efficient Nearest-Neighbor Algorithms
There are efficient nearest-neighbor algorithms, but they are not compatible with similarity measures:
cRMS is not a Euclidean metric
dRMS uses a space of dimensionality n(n-1)/2
![Page 15: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/15.jpg)
Further Dimensionality Reduction of dRMS
kd-trees require dimension 20m-averaging with dRMS is not enough
Reduce further using SVD
SVD: A tool for principal component analysis. Computes directions of greatest variance.
![Page 16: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/16.jpg)
Reduction Using SVD
1. Stack m-averaged distances matrices as vectors
2. Compute the SVD of entire set3. Project onto most important
singular vectors
dRMS is thus reduced to 20 dimensionsWithout m-averaging SVD can be too costly
![Page 17: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/17.jpg)
Testing the Method
Use decoy sets (N = 10,000) m-averaging with (m = 16) Project onto 20 largest PCs (more
than 95% of variance) Each conformation represented by
20 numbers
![Page 18: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/18.jpg)
Results
For k = 10, 25, 100 Decoy sets: ~80% correct
furthest NN off by 10% - 20% (0.7Å – 1.5Å) 1CTF, with N = 100,000 similar results Random sets 90% correct with
smaller error (5% - 10%)
When precision is important use as pre-filter with larger k than needed
![Page 19: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/19.jpg)
Running Time
N = 100,000
k = 100, for each conformation
Brute-force: ~84 hoursBrute-force + m-averaging: ~4.8 hoursBrute-force + m-averaging + SVD: 41 minutesKd-tree + m-averaging + SVD: 19 minutes
kd-trees will have more impact for larger sets
![Page 20: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/20.jpg)
Structural ClassificationComputing the similarity between structures of two different proteins is more involved:
The correspondence problem:
Which parts of the two structures should be compared?
1IRD 2MM1
vs.
![Page 21: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/21.jpg)
STRUCTAL (Gerstein & Levitt ’98)
1. Compute optimal correspondence using dynamic programming
2. Optimally align the corresponding parts in space to minimize cRMS
3. Repeat until convergence
Result depends on initial correspondence!
O(n1n2) time
![Page 22: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/22.jpg)
STRUCTAL + m-averaging
Compute similarity for structures of same SCOP super-family with and without m-averaging
n/m correlation
3
5
8
0.60 – 0.660.44 – 0.580.35 – 0.57
NN results were disappointing
speed-up
~7
~19
~46
![Page 23: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/23.jpg)
Conclusion
Fast computation of similarity measures
Trade-off between speed and precision Exploits chain topology and limited
compactness of proteins Allows use of efficient nearest-neighbor
algorithms Can be used as pre-filter when
precision is important
![Page 24: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/24.jpg)
Random Chains
c0c1
c2
c3c4
c5
cn-1c6
c7
c8
The dimensions are uncorrelated Average behavior can be
approximated by normal variables:
1 (0,1)i i N l c c
![Page 25: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/25.jpg)
1-D Haar Wavelet Transform
Recursive averaging and differencing of the values
Level AveragesDetail Coefficients
[ 9 7 2 6 5 1 4 6 ]
[ 8 4 3 5 ]
[ 6 4 ]
[ 5 ]
[ 1 -2 2 -1 ]
[ -2 -1 ]
[ 1 ]
3
2
1
0
[ 9 7 2 6 5 1 4 6 ]
[ 5 1 -2 -1 1 -2 2 1 ]
![Page 26: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/26.jpg)
Haar Wavelets and Compression
Compress by discarding smallest coefficients
When discarding detail coefficients the approximation error is the root of the sum of the squares of the discarded coefficients
![Page 27: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/27.jpg)
Transform of Random Chains
m-averaging(m = 2v)
Discarding lowest levels of detail coeeficients
logn v
For random chains the pdf of the detail coefficients is:
Coefficients expected to be ordered!
( ) 0, 4j jd N O
Discard coefficients starting at lowest level
![Page 28: Approximation of Protein Structure for Fast Similarity Measures](https://reader035.fdocuments.net/reader035/viewer/2022062806/56814e8d550346895dbc2f9b/html5/thumbnails/28.jpg)
Random Chains and Proteins