Workshop - Blogs als Portfoliowerkzeug (Tutorenschulung WS2013)
VL Algorithmische BioInformatik (19710)...
Transcript of VL Algorithmische BioInformatik (19710)...
VL Algorithmische BioInformatik (19710)
WS2013/2014
Woche 12 - Mittwoch
Tim Conrad
AG Computational Proteomics
Institut für Mathematik & Informatik, Freie Universität Berlin
Vorlesungsthemen
Part 1: Background Basics (4)
1. The Nucleic Acid World
2. Protein Structure
3. Dealing with Databases
Part 2: Sequence Alignments (3)
4. Producing and Analyzing Sequence Alignments
5. Pairwise Sequence Alignment and Database Searching
6. Patterns, Profiles, and Multiple Alignments
Part 3: Evolutionary Processes (3)
7. Recovering Evolutionary History
8. Building Phylogenetic Trees
Part 4: Genome Characteristics (4)
9. Revealing Genome Features
10. Gene Detection and Genome Annotation
Part 5: Secondary Structures (4) 11. Obtaining Secondary Structure from Sequence 12. Predicting Secondary Structures Part 6: Tertiary Structures (4) 13. Modeling Protein Structure 14. Analyzing Structure-Function Relationships Part 7: Cells and Organisms (7) 15. Proteome and Gene Expression Analysis 16. Clustering Methods and Statistics 17. Systems Biology
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 2
Buch: 14.2
Knowledge Based Approaches
• Homology Modelling – Need homologues of known protein structure
– Backbone modelling
– Side chain modelling
– Fail in absence of homology
• Threading Based Methods – New way of fold recognition
– Sequence is tried to fit in known structures
– Motif recognition
– Loop & Side chain modelling
– Fail in absence of known example
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 3
Some Rosetta-Predicted Structures
• Native indicates the real structure
• Model indicates the predicted structure
• The rightmost structures in cases (B) and (C) show similar structures identified by searching a structure database with the model
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 4
How do we know that a result is good?
How to compare to other structures? (E.g. result of a prediction with a known structure)
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 5
Why structural alignment?
• Structural similarity can point to remote
evolutionary relationship
• Shared structural motifs among proteins suggest
similar biological function
• Getting insight into
sequence-structure mapping
(e.g., which parts of the
protein structure are conserved
among related organisms, such
as binding sites etc).
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014
6
Human Myoglobin
pdb:2mm1
Human Hemoglobin
alpha-chain
pdb:1jebA
Sequence id: 27%
Structural id: 90%
Example Alignment
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 7
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014
http://www.ebi.ac.uk/interpro/entry/IPR023413
The G2 domain of nidogen contains a beta-can structure that exhibits extraordinary similarity to GFP, even though their sequences show only low sequence identity [PMID: 11427896]. Nidogen is a component of basement membranes, whose interactions with other basement membrane proteins contribute to the
assembly and function of the basement membrane. The G2 domain serves as a protein-binding module. The structure is similar enough between GFP and the G2 domain of nidogen to suggest a common ancestral origin.
All by All Structural Alignment
Example: Green Fluorescent Protein GFP
• Nidogen-1 (NID-1): similar 11-stranded
beta-barrel and internal helices
• 3 Å RMSD, only 9% sequence identity
• NID-1 component of basement
membrane, no chromophore
• GFP and NID-1 may share common
ancestor
Objective: Identify novel architectures
or unexpected structural similarities in
the absence of sequence similarity.
Representative chains from 40%
sequence identity clusters are
aligned with jFATCAT
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 10
Sequence alignment
Based on residue identity, sometimes with a modified alphabet
--AARNEDDDGKMPSTF-L
E-AARNFG-DGK--STFIL
Algorithms: Dynamic programming + heuristics
Applications: BLAST, FASTA, FLASH and others
Used for:
evolution studies
protein function analysis
guessing on structure similarity
Structure alignment
Based on geometrical equivalence of residue positions, residue type disregarded
Used for:
protein function analysis
some aspects of evolution studies
Algorithms: Dynamic programming, graph theory, MC, geometric hashing and others
Applications: DALI, VAST, CE, MASS, SSM and others
Sequence and structure alignment
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 11
12
Structure alignment may be defined as identification of residues occupying “equivalent” geometrical positions
Unlike in sequence alignment, residue type is neglected
Used for measuring the structural similarity
protein classification and functional analysis
database searches
Structural Alignment
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 12
Structure Alignment
+
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 13
Structure Alignment
+
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 14
Global versus Local
Global alignment
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 15
Local Alignment
motif
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 16
How to get the alignment?
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 17
What is the best transformation that superimposes the unicorn on the lion?
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 18
Solution
Regard the shapes as sets of points and try to “match” these sets using a
transformation
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 19
Not so good result
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 20
Good result
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 21
Possible transformations
• Rotation
• Translation
• Scaling
and more….
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 22
Translation
X
Y
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 23
Rotation
X
Y
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 24
Scale
X
Y
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 25
(Not used in protein structural alignment)
We represent a protein as a geometric object in
the plane.
The object consists of points represented by
coordinates (x, y, z).
Thr
Lys
Met Gly
Glu
Ala
Approach
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 26
The aim:
• Given two proteins: find the transformation that produces the best superimposition of one protein onto the other
Aim
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 27
Correspondence is Unknown
Given two configurations of points in the three dimensional space:
+
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 28
Find those rotations and translations of one of the point sets which produce “large” superimpositions of corresponding 3-D points
?
Idea
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 29
The best transformation
T
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 30
T
Simple case – two closely related proteins with the same number of amino acids.
Structure alignment
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 31
Question:
how do we asses the
quality of the
transformation?
How to score an alignment?
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 32
Structure Alignment
• How to score an alignment – Sequences: e.g. percentage of matching residues
– Structure: rmsd (root mean square deviation)
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 33
Root Mean Square Deviation
• What is the distance between two points a
with coordinates xa and ya and b with
coordinates xb and yb?
– Euclidean distance:
d(a,b) = √ (xa -xb )2 + (ya -yb )
2 + (za -zb )2
a
b
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 34
Root Mean Square Deviation
• In a structure alignment the score
measures either how far – (1) the aligned atoms or
– (2) all backbone atoms
are from each other on average
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 35
Quality of Alignment and Example
• Unit of RMSD => e.g. Ångstroms
– Identical structures => RMSD = “0”
– Similar structures => RMSD is small (1 – 3 Å)
– Distant structures => RMSD > 3 Å
• Structural superposition of gamma-chymotrypsin and Staphylococcus aureus epidermolytic toxin A
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 36
T
Simple case – two closely related proteins with the same number of amino acids.
Structure alignment
Find a transformation to achieve the best superposition
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 37
Find a 3-D rigid transformation T* such that:
rmsd( T*(P), Q ) = minT √ S i|T(pi) - qi |2 /n
(from scoring function)
RMS Superposition: Optimal Movement of one Structure to Minimize the RMS
Methods of Solution:
springs (F ~ kx)
SVD
Kabsch
E.g. by SVD, see http://nghiaho.com/?page_id=671
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 38
Pitfalls of RMSD
• All atoms are treated equally
(residues on the surface have a higher degree of freedom than those in the core)
• Best alignment does not always mean minimal RMSD
• Does not take into account the attributes of the amino acids
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014
39
Given two sets of points A = (a1, a2, …, an) and B = (b1,b2,…bm) in Cartesian space, find the optimal subsets A(P) and B(Q) with |A(P)| = |B(Q)|, and find the optimal rigid body transformation T between the two subsets A(P) and B(Q) that minimizes a given distance metric D over all possible rigid body transformation G, i.e. The two subsets A(P) and B(Q) define a “correspondence”, and p = |A(P)| = |B(Q)| is called the correspondence length. Naturally, the correspondence length is maximal when A(P) and B(Q) are similar. Therefore there are essentially two problems in structure alignment: (i.) Find the correspondence set (which is NP-hard), and (ii.) Find the alignment transform (which is O(n^2)).
)))(()((min QBTPADT
Formalizing the structure alignment problem
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 40
Flexible alignment vs. Rigid alignment
Rigid alignment Flexible alignment
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 41
Alignment Algorithms
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 42
• DALI: Uses 2D distance matrices between Cα atoms to represent each structure. Conceptually, the alignment problem is then straightforward, you must simply maximally overlay the matrices.
Holm and Sander. Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233:123-128.
• CE (Combinatorial extension): Uses characteristics of local geometry to seed structural alignments and then joins these regions of local similarity (called aligned fragment pairs, AFPs) into an “optimal” path for the full alignment. Bottom-up approach.
Shindyalov and Bourne, Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Prot Eng, 1998, 11:739-747.
• SSAP (Sequential Structure Alignment Program ): Uses a “double-dynamic programming” algorithm: high level and low level matrices. Used in CATH classification. Taylor WR, Orengo CA. 1989b. Protein structure alignment. J Mol Biol 208:l-22
• VAST (Vector Alignment Search Tool ), TM (Template Modelling)-align and many more…
Structure alignment methods
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 43
The DALI Algorithm
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 44
• Distance mAtrix aLIgnment
• Liisa Holm and Chris Sander,
“Protein structure comparison
by alignment of distance
matrices”, Journal of
Molecular Biology Vol. 233,
1993.
• Liisa Holm and Chris Sander,
“Mapping the protein
universe”, Science Vol. 273,
1996.
• Liisa Holm and Chris Sander,
“Alignment of three-
dimensional protein
structures: network server for
database searching”, Methods
in Enzymology Vol. 266, 1996.
How does DALI work?
• Based on fact: similar 3D structures have similar
intra-molecular distances.
• Background idea
– Represent each protein as a 2D matrix storing
intra-molecular distance.
– Place one matrix on top of another and slide vertically and
horizontally – until a common sub-matrix with the best
match is found.
• Actual implementation
– Break each matrix into small sub-matrices of fixed size.
– Pair-up similar sub-matrices (one from each protein).
– Assemble the sub-matrix pairs to get the overall
alignment.
Protein A Protein
B
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 45
Structure Representation of DALI
• 3D shape is described with a distance matrix which
stores all intra-molecular distances between the Cα
atoms.
• Distance matrix is independent of coordinate frame.
• Contains enough information to re-construct the 3D
coordinates.
0 d12 d13 d14
d12 0 d23 d24
d13 d23 0 d34
d14 d24 d34 0
1 2 3 4
1
2
3
4
Protein A Distance matrix for Protein A Distance matrix for 2drpA and 1bbo
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 46
Intra-molecular distance for myoglobin
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 47
DALI Algorithm
1. Decompose distance matrix into elementary
contact patterns (sub-matrices of fixed size) • Use hexapeptide-hexapeptide contact patterns.
2. Compare contact patterns (pair-wise), and
store the matching pairs in pair list.
3. Assemble pairs in the correct order to yield
the overall alignment.
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 48
Overview of the Dali Algorithm
Starting with a contact map… Dali attempts to maximize the overlap of the contact maps; however, doing so globally is NP-hard, so the methods focus on local comparisons.
Image from Amy Keating at MIT
Image from Mark Maciejewski at UConn
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 49
The DALI (Distance matrix alignment) algorithm is based on distance matrix comparison methods.
Similarity score:
Structure A Structure B
iA jA
jB
iB
A
ijdB
ijd
i and j are equivalent residues in A and B L is the number of such pairs or the size of the substructure is the similarity measure based on the CA distance and A
ijd B
ijd
L
i
L
j
jiS1 1
),(
Overview of the Dali Algorithm
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 50
1. Compute distance matrices for both protein A and B
2. Extract a full set of overlapped hexapeptide (6x6) sub-matrices (also called contact patterns) from each matrix
3. Each 6x6 distance matrix from protein A is compared with the 6x6 distance matrix in protein B.
A
ijd
B
ijd
dij
A - dij
B
6x6 CA distance matrices
For example: 6.2 – 12.7 = -6.5
Dali – step by step
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 51
Consider protein A with 100 residues, meaning we have 100 - 5 = 95 hexapeptides. (95^2)/2 = 4.512 contact pattern matrices
Consider protein B with 150 residues, meaning 150-5 = 145 hexapeptides. (145^2)/2 = 10.512 contact pattern matrices
Even for these two relatively small proteins, there would be
4.512 x 10.512 = 47.430.144 comparisons between A and B.
Step 1: For each hexapeptide, a distance matrix compares it to every other hexapeptide within its structure (fill matrix). Step 2: Every distance matrix created in step 1 for each protein are compared to each other.
“And again…”
Dali – step by step
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 52
(4) Each contact pattern in protein A is paired with its most similar pattern in protein B, a process that generates a pair list
(5) The list is sorted based on the strength of pair similarity of contact patterns (i, j label pairs of matched residues; L number of these pairs)
A note about the similarity measure : We want to maximize the number of
equivalent residues while minimize structural variations – it is a tradeoff. That is, if the criteria are so tough that minor structure deviations are not allowed, then the number of matching contact patterns is likely to be very small.
L
i
L
j
jiS1 1
),(
Image from Amy Keating at MIT
Note that unmatched residues do not contribute to the overall similarity score S.
Dali – step by step
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 53
6. Merging contact patterns to form chains and reduce complexity
The search space is reduced because only the central contact pattern is retained (actually, the one that gives the smallest average intra-pattern distance).
Dali – step by step
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 55
7.) After removing the overlapping patterns, we are still left with way too many contact patterns to exhaustively compare all possible pairs.
Start comparing pairs at random:
• Keep list of positive scores (discard negative scores) • Keep comparing till your list has 80.000 positive scores
Sort the list and keep the best 40.000 contact pattern matches. 8.) End game: Need to find optimal alignment of the 40.000 contact
patterns such that the alignment occurs over as wide a range of the structural pair as possible.
Dali – step by step
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 56
Assembly of Alignments
• Non-trivial combinatory problem.
• Assembled in the manner (AB) – (A’B’), (BC) – (B’C’), . . .
(i.e., having one overlapping segment with the previous
alignment)
• Available Alignment Methods:
– Monte Carlo optimization
– Brach-and-bound
– Neighbor walk
• Using Markov Chain Monte Carlo (MCMC), start with a
random contact pattern from the list of 40.000, and then
“walk” to another overlapping pattern (must extend the
contact pattern by 4 residues) using the standard Metropolis
criterion.
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 57
Monte Carlo Optimization
• Used in the earlier versions of DALI.
• Algorithm – Compute a similarity score for the current alignment.
– Make a random trial change to the current alignment
(adding a new pair or deleting an existing pair).
– Compute the change in the score (S).
– If S > 0, the move is always accepted.
– If S <= 0, the move may be accepted by the probability
exp(β * S), where β is a parameter.
– Once a move is accepted, the change in the alignment
becomes permanent.
– This procedure is iterated until there is no further change
in the score, i.e., the system has converged.
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 58
Branch-and-bound method
• Used in the later versions of DALI.
• Based on Lathrop and Smith’s
(1996) threading (sequence-
structure alignment) algorithm.
• Solution space consists of all
possible placements of residues in
protein A relative to the segment
of residues of protein B.
• The algorithm recursively split the
solution space that yields the
highest upper bound of the
similarity score until there is a
single alignment trace left.
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 59
Statistical significance of Dali alignments
Dali uses Z-score to show the significance of the alignment
A common and practical approach to the problem of assessing alignment significance is to determine if the alignment score is better than one could expect by chance. Dali compares each alignment score against an All-to-All protein structure comparison (normalized by length), which defines the z-score. • Dali Z-scores > 2 are thought to be meaningful.
s
SSZ
deviation standard :
score average :
score raw :
s
S
S
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 60
Schematic View of DALI Algorithm
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 61
• In the first step of the algorithm, similar submatrices of size six in two proteins are found by comparing their distance matrices.
• These comparisons result in alignments of size six between two proteins. Then, compatible alignments are merged to obtain larger alignments called seeds.
Schematic View of DALI Algorithm
3D (Spatial) 2D (Distance Matrix) 1D (Sequence)
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 62
Schematic View of DALI Algorithm
3D (Spatial) 2D (Distance Matrix) 1D (Sequence)
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 63
Schematic View of DALI Algorithm
3D (Spatial) 2D (Distance Matrix) 1D (Sequence)
Tim Conrad, VL Algorithmische Bioinformatik, WS2013/2014 64
14.3
Tim Conrad
AG Medical Bioinformatics
www.mecicalbioinformatics.de
Mehr Informationen im Internet unter
medicalbioinformatics.de/teaching
Vielen Dank!
Weitere Fragen