VL Algorithmische BioInformatik (19710)...

VL Algorithmische BioInformatik (19710)

WS2013/2014

Woche 12 - Mittwoch

Tim Conrad

AG Computational Proteomics

Institut für Mathematik & Informatik, Freie Universität Berlin

Vorlesungsthemen

Part 1: Background Basics (4)

1. The Nucleic Acid World

2. Protein Structure

3. Dealing with Databases

Part 2: Sequence Alignments (3)

4. Producing and Analyzing Sequence Alignments

5. Pairwise Sequence Alignment and Database Searching

6. Patterns, Profiles, and Multiple Alignments

Part 3: Evolutionary Processes (3)

7. Recovering Evolutionary History

8. Building Phylogenetic Trees

Part 4: Genome Characteristics (4)

9. Revealing Genome Features

10. Gene Detection and Genome Annotation

Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (7) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology

Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 2

Buch: 14.2

Knowledge Based Approaches

• Homology Modelling – Need homologues of known protein structure

– Backbone modelling

– Side chain modelling

– Fail in absence of homology

• Threading Based Methods – New way of fold recognition

– Sequence is tried to fit in known structures

– Motif recognition

– Loop & Side chain modelling

– Fail in absence of known example


Some Rosetta-Predicted Structures

• Native indicates the real structure

• Model indicates the predicted structure

• The rightmost structures in cases (B) and (C) show similar structures identified by searching a structure database with the model


How do we know that a result is good?

How to compare to other structures? (E.g. result of a prediction with a known structure)


Why structural alignment?

• Structural similarity can point to remote

evolutionary relationship

• Shared structural motifs among proteins suggest

similar biological function

• Getting insight into

sequence-structure mapping

(e.g., which parts of the

protein structure are conserved

among related organisms, such

as binding sites etc).

Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014

6

Human Myoglobin

pdb:2mm1

Human Hemoglobin

alpha-chain

pdb:1jebA

Sequence id: 27%

Structural id: 90%

Example Alignment



http://www.ebi.ac.uk/interpro/entry/IPR023413

The G2 domain of nidogen contains a beta-can structure that exhibits extraordinary similarity to GFP, even though their sequences show only low sequence identity [PMID: 11427896]. Nidogen is a component of basement membranes, whose interactions with other basement membrane proteins contribute to the

assembly and function of the basement membrane. The G2 domain serves as a protein-binding module. The structure is similar enough between GFP and the G2 domain of nidogen to suggest a common ancestral origin.

All by All Structural Alignment

Example: Green Fluorescent Protein GFP

• Nidogen-1 (NID-1): similar 11-stranded

beta-barrel and internal helices

• 3 Å RMSD, only 9% sequence identity

• NID-1 component of basement

membrane, no chromophore

• GFP and NID-1 may share common

ancestor

Objective: Identify novel architectures

or unexpected structural similarities in

the absence of sequence similarity.

Representative chains from 40%

sequence identity clusters are

aligned with jFATCAT


Sequence alignment

Based on residue identity, sometimes with a modified alphabet

--AARNEDDDGKMPSTF-L

E-AARNFG-DGK--STFIL

Algorithms: Dynamic programming + heuristics

Applications: BLAST, FASTA, FLASH and others

Used for:

evolution studies

protein function analysis

guessing on structure similarity

Structure alignment

Based on geometrical equivalence of residue positions, residue type disregarded

Used for:

protein function analysis

some aspects of evolution studies

Algorithms: Dynamic programming, graph theory, MC, geometric hashing and others

Applications: DALI, VAST, CE, MASS, SSM and others

Sequence and structure alignment


12

Structure alignment may be defined as identification of residues occupying “equivalent” geometrical positions

Unlike in sequence alignment, residue type is neglected

Used for measuring the structural similarity

protein classification and functional analysis

database searches

Structural Alignment


Structure Alignment

+


Global versus Local

Global alignment


Local Alignment

motif


How to get the alignment?


What is the best transformation that superimposes the unicorn on the lion?


Solution

Regard the shapes as sets of points and try to “match” these sets using a

transformation


Not so good result


Good result


Possible transformations

• Rotation

• Translation

• Scaling

and more….


Translation

X

Y


Rotation

X

Y


Scale

X

Y


(Not used in protein structural alignment)

We represent a protein as a geometric object in

the plane.

The object consists of points represented by

coordinates (x, y, z).

Thr

Lys

Met Gly

Glu

Ala

Approach


The aim:

• Given two proteins: find the transformation that produces the best superimposition of one protein onto the other

Aim


Correspondence is Unknown

Given two configurations of points in the three dimensional space:

+


Find those rotations and translations of one of the point sets which produce “large” superimpositions of corresponding 3-D points

?

Idea


The best transformation

T


T

Simple case – two closely related proteins with the same number of amino acids.

Structure alignment


Question:

how do we asses the

quality of the

transformation?

How to score an alignment?


Structure Alignment

• How to score an alignment – Sequences: e.g. percentage of matching residues

– Structure: rmsd (root mean square deviation)


Root Mean Square Deviation

• What is the distance between two points a

with coordinates xa and ya and b with

coordinates xb and yb?

– Euclidean distance:

d(a,b) = √ (xa -xb )2 + (ya -yb )

2 + (za -zb )2

a

b


Root Mean Square Deviation

• In a structure alignment the score

measures either how far – (1) the aligned atoms or

– (2) all backbone atoms

are from each other on average


Quality of Alignment and Example

• Unit of RMSD => e.g. Ångstroms

– Identical structures => RMSD = “0”

– Similar structures => RMSD is small (1 – 3 Å)

– Distant structures => RMSD > 3 Å

• Structural superposition of gamma-chymotrypsin and Staphylococcus aureus epidermolytic toxin A


T

Simple case – two closely related proteins with the same number of amino acids.

Structure alignment

Find a transformation to achieve the best superposition


Find a 3-D rigid transformation T* such that:

rmsd( T*(P), Q ) = minT √ S i|T(pi) - qi |2 /n

(from scoring function)

RMS Superposition: Optimal Movement of one Structure to Minimize the RMS

Methods of Solution:

springs (F ~ kx)

SVD

Kabsch

E.g. by SVD, see http://nghiaho.com/?page_id=671


Pitfalls of RMSD

• All atoms are treated equally

(residues on the surface have a higher degree of freedom than those in the core)

• Best alignment does not always mean minimal RMSD

• Does not take into account the attributes of the amino acids


39

Given two sets of points A = (a1, a2, …, an) and B = (b1,b2,…bm) in Cartesian space, find the optimal subsets A(P) and B(Q) with |A(P)| = |B(Q)|, and find the optimal rigid body transformation T between the two subsets A(P) and B(Q) that minimizes a given distance metric D over all possible rigid body transformation G, i.e. The two subsets A(P) and B(Q) define a “correspondence”, and p = |A(P)| = |B(Q)| is called the correspondence length. Naturally, the correspondence length is maximal when A(P) and B(Q) are similar. Therefore there are essentially two problems in structure alignment: (i.) Find the correspondence set (which is NP-hard), and (ii.) Find the alignment transform (which is O(n^2)).

)))(()((min QBTPADT

Formalizing the structure alignment problem


Flexible alignment vs. Rigid alignment

Rigid alignment Flexible alignment


Alignment Algorithms


• DALI: Uses 2D distance matrices between Cα atoms to represent each structure. Conceptually, the alignment problem is then straightforward, you must simply maximally overlay the matrices.

Holm and Sander. Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233:123-128.

• CE (Combinatorial extension): Uses characteristics of local geometry to seed structural alignments and then joins these regions of local similarity (called aligned fragment pairs, AFPs) into an “optimal” path for the full alignment. Bottom-up approach.

Shindyalov and Bourne, Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Prot Eng, 1998, 11:739-747.

• SSAP (Sequential Structure Alignment Program ): Uses a “double-dynamic programming” algorithm: high level and low level matrices. Used in CATH classification. Taylor WR, Orengo CA. 1989b. Protein structure alignment. J Mol Biol 208:l-22

• VAST (Vector Alignment Search Tool ), TM (Template Modelling)-align and many more…

Structure alignment methods


The DALI Algorithm


• Distance mAtrix aLIgnment

• Liisa Holm and Chris Sander,

“Protein structure comparison

by alignment of distance

matrices”, Journal of

Molecular Biology Vol. 233,

1993.


“Mapping the protein

universe”, Science Vol. 273,

1996.


“Alignment of three-

dimensional protein

structures: network server for

database searching”, Methods

in Enzymology Vol. 266, 1996.

How does DALI work?

• Based on fact: similar 3D structures have similar

intra-molecular distances.

• Background idea

– Represent each protein as a 2D matrix storing

intra-molecular distance.

– Place one matrix on top of another and slide vertically and

horizontally – until a common sub-matrix with the best

match is found.

• Actual implementation

– Break each matrix into small sub-matrices of fixed size.

– Pair-up similar sub-matrices (one from each protein).

– Assemble the sub-matrix pairs to get the overall

alignment.

Protein A Protein

B


Structure Representation of DALI

• 3D shape is described with a distance matrix which

stores all intra-molecular distances between the Cα

atoms.

• Distance matrix is independent of coordinate frame.

• Contains enough information to re-construct the 3D

coordinates.

0 d12 d13 d14

d12 0 d23 d24

d13 d23 0 d34

d14 d24 d34 0

1 2 3 4

1

2

3

4

Protein A Distance matrix for Protein A Distance matrix for 2drpA and 1bbo


Intra-molecular distance for myoglobin


DALI Algorithm

1. Decompose distance matrix into elementary

contact patterns (sub-matrices of fixed size) • Use hexapeptide-hexapeptide contact patterns.

2. Compare contact patterns (pair-wise), and

store the matching pairs in pair list.

3. Assemble pairs in the correct order to yield

the overall alignment.


Overview of the Dali Algorithm

Starting with a contact map… Dali attempts to maximize the overlap of the contact maps; however, doing so globally is NP-hard, so the methods focus on local comparisons.

Image from Amy Keating at MIT

Image from Mark Maciejewski at UConn


The DALI (Distance matrix alignment) algorithm is based on distance matrix comparison methods.

Similarity score:

Structure A Structure B

iA jA

jB

iB

A

ijdB

ijd

i and j are equivalent residues in A and B L is the number of such pairs or the size of the substructure is the similarity measure based on the CA distance and A

ijd B

ijd

L

i

L

j

jiS1 1

),(

Overview of the Dali Algorithm


1. Compute distance matrices for both protein A and B

2. Extract a full set of overlapped hexapeptide (6x6) sub-matrices (also called contact patterns) from each matrix

3. Each 6x6 distance matrix from protein A is compared with the 6x6 distance matrix in protein B.

A

ijd

B

ijd

dij

A - dij

B

6x6 CA distance matrices

For example: 6.2 – 12.7 = -6.5

Dali – step by step


Consider protein A with 100 residues, meaning we have 100 - 5 = 95 hexapeptides. (95^2)/2 = 4.512 contact pattern matrices

Consider protein B with 150 residues, meaning 150-5 = 145 hexapeptides. (145^2)/2 = 10.512 contact pattern matrices

Even for these two relatively small proteins, there would be

4.512 x 10.512 = 47.430.144 comparisons between A and B.

Step 1: For each hexapeptide, a distance matrix compares it to every other hexapeptide within its structure (fill matrix). Step 2: Every distance matrix created in step 1 for each protein are compared to each other.

“And again…”



(4) Each contact pattern in protein A is paired with its most similar pattern in protein B, a process that generates a pair list

(5) The list is sorted based on the strength of pair similarity of contact patterns (i, j label pairs of matched residues; L number of these pairs)

A note about the similarity measure : We want to maximize the number of

equivalent residues while minimize structural variations – it is a tradeoff. That is, if the criteria are so tough that minor structure deviations are not allowed, then the number of matching contact patterns is likely to be very small.

L

i

L

j

jiS1 1

),(

Image from Amy Keating at MIT

Note that unmatched residues do not contribute to the overall similarity score S.



6. Merging contact patterns to form chains and reduce complexity

The search space is reduced because only the central contact pattern is retained (actually, the one that gives the smallest average intra-pattern distance).



7.) After removing the overlapping patterns, we are still left with way too many contact patterns to exhaustively compare all possible pairs.

Start comparing pairs at random:

• Keep list of positive scores (discard negative scores) • Keep comparing till your list has 80.000 positive scores

Sort the list and keep the best 40.000 contact pattern matches. 8.) End game: Need to find optimal alignment of the 40.000 contact

patterns such that the alignment occurs over as wide a range of the structural pair as possible.



Assembly of Alignments

• Non-trivial combinatory problem.

• Assembled in the manner (AB) – (A’B’), (BC) – (B’C’), . . .

(i.e., having one overlapping segment with the previous

alignment)

• Available Alignment Methods:

– Monte Carlo optimization

– Brach-and-bound

– Neighbor walk

• Using Markov Chain Monte Carlo (MCMC), start with a

random contact pattern from the list of 40.000, and then

“walk” to another overlapping pattern (must extend the

contact pattern by 4 residues) using the standard Metropolis

criterion.


Monte Carlo Optimization

• Used in the earlier versions of DALI.

• Algorithm – Compute a similarity score for the current alignment.

– Make a random trial change to the current alignment

(adding a new pair or deleting an existing pair).

– Compute the change in the score (S).

– If S > 0, the move is always accepted.

– If S <= 0, the move may be accepted by the probability

exp(β * S), where β is a parameter.

– Once a move is accepted, the change in the alignment

becomes permanent.

– This procedure is iterated until there is no further change

in the score, i.e., the system has converged.


Branch-and-bound method

• Used in the later versions of DALI.

• Based on Lathrop and Smith’s

(1996) threading (sequence-

structure alignment) algorithm.

• Solution space consists of all

possible placements of residues in

protein A relative to the segment

of residues of protein B.

• The algorithm recursively split the

solution space that yields the

highest upper bound of the

similarity score until there is a

single alignment trace left.


Statistical significance of Dali alignments

Dali uses Z-score to show the significance of the alignment

A common and practical approach to the problem of assessing alignment significance is to determine if the alignment score is better than one could expect by chance. Dali compares each alignment score against an All-to-All protein structure comparison (normalized by length), which defines the z-score. • Dali Z-scores > 2 are thought to be meaningful.

s

SSZ

deviation standard :

score average :

score raw :

s

S

S


Schematic View of DALI Algorithm


• In the first step of the algorithm, similar submatrices of size six in two proteins are found by comparing their distance matrices.

• These comparisons result in alignments of size six between two proteins. Then, compatible alignments are merged to obtain larger alignments called seeds.


3D (Spatial) 2D (Distance Matrix) 1D (Sequence)


Tim Conrad

AG Medical Bioinformatics

www.mecicalbioinformatics.de

Mehr Informationen im Internet unter

medicalbioinformatics.de/teaching

Vielen Dank!

Weitere Fragen

VL Algorithmische BioInformatik (19710)...

Documents

Transcript of VL Algorithmische BioInformatik (19710)...