Approximation of Protein Structure for Fast Similarity Measures

Approximation of Protein Structure for Fast Similarity

Measures

Fabian SchwarzerItay Lotan

Stanford University

Comparing Protein Structures

vs.Same protein:

Analysis of MDS

and MCS

trajectories

http://folding.stanford.edu

Structure prediction applications• Evaluating decoy sets

• Clustering predictions (Shortle et al, Biophysics ’98)

Graph-based methods

Stochastic Roadmap Simulation (Apaydin et al, RECOMB ’02)

k Nearest-Neighbors Problem

Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c.

Can be done in

N – size of S

L – time to compare two conformations

(log )O N k L

k Nearest-Neighbors Problem

What if needed for all c in S ?

2 (log )O N k L - too much time

Can be improved by:

1. Reducing L

2. A more efficient algorithm

Our Solution

Reduce structure description

Approximate but fast similarity measures

Efficient nearest-neighbor algorithms can be used

Reduce description further

Description of a Protein’s Structure

3n coordinates of Cα atoms (n – Number of residues)

Similarity Measures - cRMS

The RMS of the distances between corresponding atoms after the two conformations are optimally aligned

2

21

1( , ) min

n

T i ii

cRMS P Q p Tqn

Computed in O(n) time

Similarity Measures - dRMS

The Euclidean distance between the intra-molecular distances matrices of the two conformations

2

2 1

2( , )

( 1)

n iP Qij ij

i j

dRMS P Q d dn n

Computed in O(n2) time

m-Averaged Approximation Cut chain into m pieces Replace each sequence of n/m Cα

atoms by its centroid

3n coordinates

3m coordinates

Why m-Averaging?

Averaging reduces description of random chains with small error

Demonstrated through Haar wavelet analysis

Protein backbones behave on average like random chains Chain topology Limited compactness

Evaluation: Test Sets

1. Decoy sets: conformations from the Park-Levitt set (Park & Levitt, JMB ’96), N = 10,000

2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, Proteins ’00), N = 5000

9 structurally diverse proteins of size 38 -76 residues:

Decoy Sets Correlation

m cRMS dRMS

8

121620

4 0.37 – 0.730.84 – 0.980.98 – 0.990.98 – 0.990.98 – 0.99

0.40 – 0.860.70 – 0.940.92 – 0.960.92 – 0.980.93 – 0.97Higher Correlation for random

sets!

Speed-up for Decoy Sets

Between 5X and 8X for cRMS (m = 8) Between 9X and 36X for dRMS (m =

12)with very small error

For random sets the speed-up for dRMS was between 25X and 64X (m = 8)

Efficient Nearest-Neighbor Algorithms

There are efficient nearest-neighbor algorithms, but they are not compatible with similarity measures:

cRMS is not a Euclidean metric

dRMS uses a space of dimensionality n(n-1)/2

Further Dimensionality Reduction of dRMS

kd-trees require dimension 20m-averaging with dRMS is not enough

Reduce further using SVD

SVD: A tool for principal component analysis. Computes directions of greatest variance.

Reduction Using SVD

1. Stack m-averaged distances matrices as vectors

2. Compute the SVD of entire set3. Project onto most important

singular vectors

dRMS is thus reduced to 20 dimensionsWithout m-averaging SVD can be too costly

Testing the Method

Use decoy sets (N = 10,000) m-averaging with (m = 16) Project onto 20 largest PCs (more

than 95% of variance) Each conformation represented by

20 numbers

Results

For k = 10, 25, 100 Decoy sets: ~80% correct

furthest NN off by 10% - 20% (0.7Å – 1.5Å) 1CTF, with N = 100,000 similar results Random sets 90% correct with

smaller error (5% - 10%)

When precision is important use as pre-filter with larger k than needed

Running Time

N = 100,000

k = 100, for each conformation

Brute-force: ~84 hoursBrute-force + m-averaging: ~4.8 hoursBrute-force + m-averaging + SVD: 41 minutesKd-tree + m-averaging + SVD: 19 minutes

kd-trees will have more impact for larger sets

Structural ClassificationComputing the similarity between structures of two different proteins is more involved:

The correspondence problem:

Which parts of the two structures should be compared?

1IRD 2MM1

vs.

STRUCTAL (Gerstein & Levitt ’98)

1. Compute optimal correspondence using dynamic programming

2. Optimally align the corresponding parts in space to minimize cRMS

3. Repeat until convergence

Result depends on initial correspondence!

O(n1n2) time

STRUCTAL + m-averaging

Compute similarity for structures of same SCOP super-family with and without m-averaging

n/m correlation

3

5

8

0.60 – 0.660.44 – 0.580.35 – 0.57

NN results were disappointing

speed-up

~7

~19

~46

Conclusion

Fast computation of similarity measures

Trade-off between speed and precision Exploits chain topology and limited

compactness of proteins Allows use of efficient nearest-neighbor

algorithms Can be used as pre-filter when

precision is important

Random Chains

c0c1

c2

c3c4

c5

cn-1c6

c7

c8

The dimensions are uncorrelated Average behavior can be

approximated by normal variables:

1 (0,1)i i N l c c

1-D Haar Wavelet Transform

Recursive averaging and differencing of the values

Level AveragesDetail Coefficients

[ 9 7 2 6 5 1 4 6 ]

[ 8 4 3 5 ]

[ 6 4 ]

[ 5 ]

[ 1 -2 2 -1 ]

[ -2 -1 ]

[ 1 ]

3

2

1

0

[ 9 7 2 6 5 1 4 6 ]

[ 5 1 -2 -1 1 -2 2 1 ]

Haar Wavelets and Compression

Compress by discarding smallest coefficients

When discarding detail coefficients the approximation error is the root of the sum of the squares of the discarded coefficients

Transform of Random Chains

m-averaging(m = 2v)

Discarding lowest levels of detail coeeficients

logn v

For random chains the pdf of the detail coefficients is:

Coefficients expected to be ordered!

( ) 0, 4j jd N O

Discard coefficients starting at lowest level

Random Chains and Proteins

Approximation of Protein Structure for Fast Similarity Measures

Documents

Transcript of Approximation of Protein Structure for Fast Similarity Measures