Structure superposition

Structure superposition ≠ Structure alignment

Lecture 11Chapter 16, Du and Bourne “Structural Bioinformatics”

Why?

A. Study the conformational changes of the same protein with or without ligands

-- Same protein sequences

B. Study the effect of mutations on protein structure -- Highly similar protein sequences

C. Assessment of protein structure prediction. -- How accurate is the predicted models? -- Same protein sequences

D. Remote homolog detection. Structures generally are preserved better than sequences over the course of evolution.

e.g. myoglobin and -hemoglobin are homologous and have similar structures, but the sequence identity can be as low as 8.5%!

E. Classification of protein folds

• Structures may align well even if there sequence similarity is low.

• For example, an optimal superposition of myoglobin and beta-hemoglobin, which are structural neighbors.

• However, their sequence identity is only 8.5%!

Why? Structure conservation > sequence conservation

Receiver Operating Characteristic

Why? Structure conservation > sequence conservation

Chothia and Lesk

True

pos

itive

rate

(%)

False positive rate (%)

ROC experiment:

- For each pair P of proteins in dataset, perform alignment and record score: S(P)

- Rank all pairs according to their scores, from highest to lowest.

- Scan ranked pairs, and record rate of true positives and true negatives.

Receiver Operating Characteristic

ASIDE: Making sense of a ROC curve

True

pos

itive

rate

(%)

False positive rate (%)

ASIDE: Making sense of a ROC curve

1.00 Yes0.99 Yes0.98 Yes0.97 Yes0.96 No0.95 No0.93 Yes0.91 Yes0.89 No0.87 No0.85 No0.83 No0.83 Yes0.81 No0.77 No0.74 No0.73 No0.70 No0.69 No0.67 Yes0.62 No0.56 No0.54 No0.53 No

Prediction

Benchmark

(%)

(%)

Alignment vs. Superposition

• Structural alignment attempts to establish homology between two or more polymer structures based on their shape and 3D structure.

• Structural alignment requires no a priori knowledge of equivalent positions.

• Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques.

• Conversely, simple structural superposition uses knowledge of at least some equivalent residues to guide a rigid body superposition.

• The most basic possible comparison between protein structures makes no attempt to align the input structures.

• Requires a precalculated alignment as input to determine which of the residues in the sequence are intended to be considered in the RMSD calculation.

Structural superposition of two CheY orthologs

In pairwise structure superposition, a correspondence set of residue pairs is established by a pairwise sequence alignment.

• Superposition algorithms optimize the orientation and spatial position of the two molecules with respect to each other.

• Superposition usually starts with a sequence comparison, which establishes the one-to-one relationships between pairs of atoms from which the RMSD is computed.

• This is typically a good assumption at appreciable pairwise sequence identity, but breaks down in the Twilight Zone.

• Once atom-to-atom relationships between two structures are established, the task of the algorithm is to achieve an optimal superposition with the smallest possible RMSD. It is usually impossible to achieve perfect overlap of all atoms pairs even for structures with 100% identical sequence.

• Overlaying one pair of atoms perfectly may push another pair of atoms further apart.

• Also, as in sequence alignment, there is a friction between global vs. local matching that must be considered.

Pairwise structure superposition

Global alignmentImages and content from Patrice Koehl at UCDavis

Global similarity ≠ local similarity

Local alignment

Structural motif

Images and content from Patrice Koehl at UCDavis

Global similarity ≠ local similarity

Choosing an appropriate description of structure

Structure comparisons can be done at several different levels

Individual atoms --disadvantages?

Residue positions, which can be specified by the coordinates of C, C, and the center of mass of the side-chains

What are advantages and disadvantages of using different residue representations?

Small fragments

Secondary structure elements (SSE)

Choosing an appropriate description of structure

• Only when the structures to be aligned are highly similar or even identical is it meaningful to align side-chain atom positions.

-- In which case the RMSD reflects not only the conformation of the protein backbone but also the rotameric states of the side chains.

• Other comparison criteria that reduce noise and bolster positive matches include:

-- Secondary structure assignment-- Native contact maps or residue interaction patterns-- Measures of side chain packing-- Measures of hydrogen bond retention

Contact map

Structure superposition requires minimizing the error within the framework of some object function. Which one?

• Torsion angle comparison• Distance matrices • Structure superposition (RMSD, TM-score, etc.) Most obvious & common• Secondary structure superposition (SHEBA)

This decision must also be made for structure alignment since superposition is used (many times over) in the harder problem.

Choosing an object function to extremize

Torsion angles (f,y) are:- local by nature- invariant upon rotation and translation of the molecule- compact (O(n) angles for a protein of n residues)

Add 1 degreeTo all f, y

But…


Torsion angles


1

2

3

4

6.0

8.1

5.9

1 2 3 4

1 0 3.8 6.0 8.1

2 3.8 0 3.8 5.9

3 6.0 3.8 0 3.8

4 8.1 5.9 3.8 0

• Advantages- invariant with respect to rotation and translation- can be used to compare proteins

• Disadvantages- the distance matrix is O(n2) for a protein with n residues- comparing distance matrices is a difficult problem- insensitive to chirality

Distance matrices

Scoring DM similarity (or in this case, contact map)

Introduce a gap

Scoring DM similarity (or in this case, contact map)

In superposition, gap location is defined by an alignment!In alignment, different gap positions are tried till the best overlap is identified.

• The most common parameter that expresses the difference between two protein structures is RMSD, or root mean squared deviation (distance), in atomic positions between the two structures.

• RMSD can be calculated as a function of all atoms or as a function of some subset of the atoms, such as the backbone or CA atoms.

• Using a subset of the protein atoms is common because it is likely that, when two protein structures are compared, they will not be identical to each other in sequence, and therefore the only atoms between which one-to-one comparison in position can be made will be the backbone atoms.

Root mean squared deviation (RMSD)

1 2 3 4 5

12

3

45

d5d4d3d2d1

RMSD calculation

The two structures must first be superimposed to calculate a meaningful RMSD value because they are currently in different coordinate systems !!!

1 2 4 5

12

3

45

d5d4d2d1

RMSD calculation (with a gap)

Blue 1 – 2 – 3 – 4 – 5Red 1 – 2 – x – 4 - 5

Estimating RMSD by averaging distances generally gets better as the correspondence set size increases.

However, RMSD must always be greater than <dis>.

RMSD vs. average D as a function of n

1 2 3 4 5

12

3

45

Using RMSD to find the optimal superposition

1 2 3 4 512

3

45

1 2 3 4 512

3

45

1 2 3 4 51

2

3

45 1 2 3 4 5

12

3

45

1 2 3 4 51

2

3

45 1 2 3 4 51

2

3

45

Superposition is too complicated for manual optimization

• Simplified problem (compared to structure alignment): we know the correspondence between set A and set B.

• We wish to compute the rigid transformation T that best align a1 with b1, a2 with b2, …, aN with bN

• The error to minimize is defined above.

Old problem, solved in Statistics, Robotics, Medical Image Analysis, etc.



• A rigid-body transformation T is a combination of a translation t and a rotation R, thus: T(x) = Rx + t.

• The quantity to be minimized is:

• The algorithm includes a fair amount of linear algebra (and a little bit of calculus) that is outside the scope of this class.

• Believe it or not, the algorithm is O(n)!

Images and content from Patrice Koehl at UCDavis Representation of 6 “trivial” DOF


Pseudocode: Superposition algorithm in reality

1.) Define error function (RMSD)

2.) Determine correspondence set (pairwise sequence alignment)

3.) Translation = align centers of mass (COM)

4.) Rotation = use matrix methods to solve for rotation that minimizes the error function (variety of methods available)

5.) Evaluate the resultant superposition

6.) Refine the superposition (b/c COM to COM may not be best translation)

7.) Iterate till convergence


3

4

5

6

12

3

4

5

6

12

1.) Generate pairwise alignment

1 2 3 - 4 5 1 2 3 4 5 6

2.) Find optimal superimposition - Translation

Back to our toy model…

- Rotation

1 2 3 4 51 2 3 4 5

3

4

56

1 2

Sequence identity = 83%RMSD = 1.0 Å

Superposition of a pair of CuZnSOD structures

<Sequence identity> = 68% 35%<RMSD> = 1.6 Å 0.6 Å

Superposition of several CuZnSOD structures

Ligand free

Complexed with trifluoperazine

Global vs. local superposition in Calmodulin

Global alignment: RMSD =15 Å (143 residues)

Local alignment: RMSD = 0.9 Å (62 residues)

Global vs. local superposition in Calmodulin

RMSD = 0.0 Å Aligned = 95Z-score = 17.3



By itself, RMSD is not a very useful error function

For example, consider a series of fragments all generated from the blue structure…

Up-weighting secondary structure, etc.

Based on the assumption that that secondary structure elements should match-up better than coil, we can easily modify the RMSD calculation to reflect that.

That is, a multiplier is applied (where x1 > x2) to up-weight the important stuff.

For example, assuming the red dots correspond to secondary structures in the figure above, RMSD’ < RMSD, which might be expected to be a more accurate reflection of the similarity between the pair.

Template Modeling Score (TM-score)• The TM-score is a measure of similarity between two protein structures with different

tertiary structures, which is intended as a more accurate measure of the quality of full-length protein structures than the often used RMSD measures.

• The TM-score indicates the difference between two structures by a score between (0,1], where 1 indicates a perfect match between two structures.

• Generally scores below 0.20 corresponds to randomly chosen unrelated proteins whereas structures with a score higher than 0.50 assume roughly the same fold.

• The TM-score is designed to be independent of protein lengths.

do = Normalization factordi = Distance between i-th

residue pairLxxx = Lengths of target

protein and alignment

Y. Zhang, J. Skolnick, Scoring function for automated assessment of protein structure template quality, Proteins, 2004 57: 702-710

Protein length dependence

The TM-score is designed to be independent of protein lengths.

RMSD vs TM-score

RMSD: 12.1ÅTM-score:0.81

RMSD:12.5ÅTM-score:0.22

Images from Dr. Zhang at KU

RMSD vs TM-score

Two structure pairs have similar topology in core regions, with TM-score=0.7 and 0.67 respectively. However, the tail/loop variations can result in significant differences in

RMSD (from 1.9 to 10.5 Å). From Figure 5 of Nucleic Acids Res (2005) 33, 2303-2309.

How significant is TM-Score = 0.5?

Combining the results from the three different datasets…, it seems quite safe to assign TM-score = 0.5 as a rough but quantitative cutoff for protein Fold/Topology definition, i.e. most of proteins with TM-score >0.5 can be considered as of the same topology whereas most proteins with a TM-score <0.5 should be of different topology. Surely, this cutoff may vary slightly with the different definitions of ‘Same Fold’. For CATH, a TM-score at 0.5 indicates that the structures have a 37% probability being in the same Topology family; when TM-score increases to 0.6, this probability increases to 80%. As for SCOP, a TM-score of 0.5 only corresponds to a 13% probability for the structures to be in the same Fold family; but a TM-score = 0.7 has the posterior probability rapidly increased to 90%

Xu and Zhang (2010), Bioinformatics, 26:889-895.

Structure superposition

Documents

Transcript of Structure superposition