1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

65
1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

Page 1: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

1

Seminar in structural bioinformatics

Pairwise Structural Alignment

Presented by: Dana Tsukerman

Page 2: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

2

Outline

• Definitions.• What is structural alignment?• Why structural alignment?

structural alignment vs. sequence alignment

• Problem definition• Background

preparing the ground for the algorithm.

• The algorithm

Page 3: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

3

Outline - cont.

• Implementation of the algorithm and an example of using a real software, based on the algorithm that will be presented.

• Method results.• Method discussion• Method summary.• Extensions and additional features - a look

ahead.• Lecture summary.

Page 4: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

4

Definitions

• Sequence alignment (remainder from last lecture), unambiguously distinguishes only between protein pairs of similar structure and non-similar structures when the pairwise sequence identity is high.

• Structure alignment - the precise arrangement of the amino acid side chains in the three dimensional structure of the protein that dictates its function.

Page 5: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

5

Quick rehearsal - Basic terms• Primary structure refers to the order (and sequence) of

amino acids along one chain.• Some regions form regular local structure (folding patterns):

• Alpha helices• Beta strands• Collectively called secondary structure elements (SSEs).• Regions connecting SSEs are loops.

• Secondary structure is the description of the type and locations of the SSEs.

• Tertiary structure is the 3-D coordinates of the atoms in a chain.

• Quaternary structure describes the spatial packing of several folded chains (not all proteins have a quaternary structure).

Page 6: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

6

Quick rehearsal - Basic terms

regular hydrogen bond

patterns of backbone atoms

Page 7: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

7

3D observation of proteins

Alpha-helix Beta-sheet

Loop and Turn

Turn or coil

Page 8: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

8

3D observation of proteins“If one looks at the collection of protein structures, one is reminded of the works of an Origami artist: Certain basic folding patterns are used over and over again and cleverly modified by minor adaptations to generate a wide variety of different protein structures. Where one such folding units is insufficient to generate the required complexity, multiple domains can be combined, such as in the camel or giraffe structure on this picture.”

Page 9: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

9

Comparison in 3D

• Starting from an example:

A B

E

C

D

BA

E D

C

back

Page 10: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

10

Comparison in 3D

• Rotation and translation coordinates - 6 degrees of freedom.

• The method is independent of the amino acid sequence.What does it mean?This method is insensitive to insertions, deletions and displacements of equivalents substructures betweens the molecules being compared.

• Proteins with similar sequences adopt very similar structures.

Page 11: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

11

Why 3D comparison?

Page 12: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

12

Why 3D comparison?

Wait a minute - isn’t sequence comparison

enough?

Page 13: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

13

Why 3D comparison?• Structures are more conserved than

sequences.• Detection of distant evolutionary

relationships.• Structural alignment can imply a functional

similarity that isn’t detectable from a sequence alignment.

• The protein docking problem.• Structure based drug design.• Applications and implications to the

protein folding problem.

Page 14: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

14

Why 3D comparison? Cont.

• For homologous proteins, this provides the “gold standard” for sequence alignment.

• For nonhomologous proteins, it allows us to identify common substructures of interest.

• Allows us to classify proteins into clusters, based on structural similarity.

• Design and engineering of synthetic proteins.

Page 15: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

15

Problem Definition

• Input: 3-D coordinate data of the structures to be compared.

• Output: regions of structural similarity (more than one, if exists), that lead to the “best” alignment.

• NP-Hard.

What’s “best?”

Most atoms matched with the

lowest RMSD

Page 16: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

16

Our goal

Find out the correspondence

between the structures

transformation T

Page 17: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

17

Preparing the ground

• Transformation: definition.• How can we evaluate the match we found?

RMSD: rehearsal from the opening lecture.

• Other methods besides the one we will discuss: and why our method is better.

• Progression rule: definition.• PDB: functionality rehearsal.• Geometric Hashing: introduction.

Page 18: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

18

3-D Transformation

• Rotation - the movement of a body in such a way that any given point of that body remains at a constant distance from some other fixed point. Will be denoted by R.

• Translation - the transformation of moving every point by a fixed distance in the same direction (addition of a constant vector to every point). Will be denoted by T.

• What is preserved under translation and rotation? relative distances within an object (e.g. Shapes)

• In total, the 3-D transformation has 6 degrees of freedom: 3 for rotation and 3 for translation.

3

x Rx T

x

Page 19: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

19

RMSD - rehearsal• A tool we use to evaluate the correspondence we

found.• RMSD - Root Mean Square Deviation

Where,• n = number of atoms• x, y = the proteins we want to compare (structures)• We want to find 3-D transformation T*, such that the

RMSD will be minimal, i.e.:

• We know how to do that in O(n).

2|| ||( ; , )

i ii

x yRMSD n x y

n

2|| *( ) ||( ; *( ), ) min

i ii

T

T x yRMSD n T x y

n

Page 20: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

20

RMSD - Example

),,( 13

12

11

1iiii xxxX

),,( 23

22

21

2jjjj xxxX

i

j

Page 21: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

21

Other methods for structural alignment

• Dynamic programming - building a score matrix, with a score for each pair of residues. or

• Other improvements of that method.

• Simplify the problem by moving from 3D space to 2D space sacrificing the optimum result for the speed.

• Comparing secondary structure elements (SSE)• Our method allows access to problems that

couldn’t be approached previously by sequence-order-dependent structural comparison methods, like the docking problem.

5( )n 4( )n

Page 22: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

22

Progression rule

• Rule definition: for elements i and k from one sequence and elements j and l from the other sequence, if element i is matched to element j and element k is matched to element l, and if k is to the right of i in the first sequence, the l must also be to the right of j in the second sequence.

• For example, the structures we saw at the beginning couldn’t be found similar by a progression rule based method (sequence -dependent). Example

Page 23: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

23

PDB - Protein Data Bankhttp://www.rcsb.org/pdb/index.html

• International structure database.• Archive of experimentally determined 3-D

structures of biological macromolecules, together with extensive annotation.

• Established at Brookhaven National Laboratories in 1971. in the beginning it held 7 structures.

• In 2003, 4,831 structures were deposited to the PDB archive.

• January 2004 snapshot: 23,792 released atomic coordinate entries.

Page 24: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

24

Geometric hashing

• Introduced for model-based object recognition in computer vision.

• Goal: identify and locate in an image all the instances of models which appear in the system’s database.

• Represents the objects to be compared in a translational and rotational invariant fashion.

• On which the first step of the algorithm presented today is based.

Page 25: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

25

Geometric hashing - cont.

• We search for a way to represent object in a way we will be able to move them, and the representation won’t change.

HOW? Building triangles!

for nodes: triangles!3nn

The triangle’s sides length doesn’t change when we move it or rotate it, and thus invariant!

Page 26: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

26

now please pay attention…

Page 27: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

27

The algorithm – major steps

1. Find (relatively small) subsets of the structures that form an initial match;

2. Find clusters in initial matches that represent similar transformations;

3. Extend the clusters to contain additional matching pairs of residues.

Page 28: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

28

Motivation remainder

Page 29: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

29

Step 1 - Finding seed matches

• Goal: search through the structures to find candidate initial matches.

• Remember what we talked about in geometric hashing?

• How to represent?

• Extensive search of the structures.

• Most difficult and time consuming step.

• Those will be referred as seed matches.

Page 30: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

30

Finding seed matches - cont.

• Seed match - list of matching pairs of atoms.

• Model & Target.

• Assumption: the structures to be compared are described by sets of interest points and their 3-D coordinates (for example: Cα atoms).

• Pair - correspondence between atoms from different structures.

Page 31: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

31

Finding seed matches - cont.

• Redefinition of the problem: is there a rotated and translated subset of the interest points of the target which matches those of the model?

• Two phases:

preprocessing

recognition

Page 32: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

32

Preprocessing - intro.

• Goal: represent the information about the atoms of the model molecule in a rotation and translation invariant manner.

• Off-line. Why?• This information will be later used in the

recognition phase.• 3 non-collinear atoms specify a unique

orthonormal reference frame (unique coordinate system). This will be a full reference frame.

Page 33: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

33

Preprocessing - intro.

• We won’t use a full reference frame: only 2 atoms (not unique). Those 2 atoms will be called reference set.

• Each atom b in the molecule is represented by the triplet of distances of the sides of the triangle formed between b and the atoms of the reference set.

Reference set: (c,a)

c

b

a

Page 34: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

34

Reference frames - clarification

Same shape,

different reference

frames

Note: the example is in the 2-D case (basic ideas the same as the 3-D case)

Page 35: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

35

Preprocessing

How to store theinformation efficiently?

Page 36: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

36

Preprocessing

• Hash table:

representation of each model atom

triplets of distances (from the atom to reference pair)

the corresponding reference pair and the atom which obtained this key.

• Note: • each atom has a redundant representation in all possible

reference sets.• Many triangles can occupy the same hash table entry.

Page 37: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

37

Preprocessing Complexity Discussion

• The complexity is highly dependent on the invariants we use for hashing.

• Complexity: O(n3)• n is the # of atoms in the model.

But… We can do better!

we will later see an optimization that will reduce the complexity to O(n2).

Page 38: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

38

Preprocessing example

• Reference frame here is a pair of coordinates.

• For instance, in cell (3, 2) we find point #2, in both reference frames, and so we store those reference frames in the hash table H(3, 2).

Note: the example is in the 2-D case (basic ideas the same as the 3-D case)

Page 39: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

39

Recognition - intro.

• Goal: discover candidate matching substructures in the target and model molecules.

• Reference set - pair of atoms.• Each such matching substructure is based

on a certain reference set, which appears both in the model and target molecules.

Page 40: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

40

Recognition algorithm

• For each reference set of the target:• Hold a vote counter for each reference set appearing in

the hash table.

• any ideas what will it hold?

• Of course, it will hold the current number of matching atoms, and the list of matching pairs.

• We will call this list the vote list.

• In the beginning: the list is initialized with null.• Pick a target atom (take predefined threshold distance into

consideration).

Page 41: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

41

Recognition - cont.

• Use the 3 sides of the triangle formed to compute their hash table key.• Access the hash table in this key• Extract all the model triangles in this entry.• For each triangle:

• Vote_counter++;• Vote_list.add(current_triangle);

• Go back to picking another atom, until we considered all of them.

Page 42: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

42

Recognition - cont.

• Check the vote counters of all the entries and consider the ones with a large # of votes.

• Verification.• Choose another reference set in the target

molecule and go back to the beginning.• Complexity: O(n3*k)

• k indicates the # of triangles in each hash table entry.

• Can be of order O(n2) after optimizing preprocessing.

Page 43: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

43

Recognition example

For instance, let’s look on point #f, it’s coordinates are (0, 4) and so this is the key to H. H(0,4) contains the reference frame (1,3), thus it’s counter will be increased (a vote for the base pairs in H) and the pair (7, f) will be added to the matched list.

Why (7, f)?

Note: the example is in the 2-D case (basic ideas the same as the 3-D case)

Page 44: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

44

Step 2 - Clustering

• Goal: clustering the seed matches that represent almost identical transformations.

• Why clustering? Many of the seed matches obtained in step 1 represent the same transformation (but contain different pairs of matching atoms).

• We use the lists of matching atoms to compute the 3-D rotation and translation, which gives us the minimal least squares distance between the target and the model.

Page 45: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

45

Clustering - cont.

• The computed 3-D transformation has 6 parameters (3 for rotation (angles) and 3 for translation (distances)).

• Join similar transformations into new groups.• What's similar?

• Small 6-D distance between the parameter vectors of the transformations.

• Clustering algorithm (iterative): • At the beginning, each seed match forms a group

represented by 6 parameters of it’s transformation.

Page 46: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

46

Clustering - cont.• The pair of groups having the minimal distance

between their transformations is chosen and a new group is formed by merging these two groups.

• Who will be the parameters of the new group?• A threshold is defined to determine an end to the

algorithm.• What do we have so far?

• # of groups, each represents one transformation obtained by averaging the individual transformations that were joined to the group.

Page 47: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

47

Clustering - cont.• The seed match of a group is obtained by choosing

matching pairs from the original seed matches that composed the group.

• But, we don’t take the union of all pairs!• Improve accuracy by choosing pairs that appear in at

least certain percentage of the seed matches.• The new correspondence lists are considered more

reliable than in step 1.• Complexity:

m = # of seed matches to be clustered.

2( )o m

Page 48: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

48

Step 3 - Extending

• Goal: extend the correspondence lists from step 2 to contain additional matching pairs.

• Remember! the transformation representing each group was computed by taking the average of the initial transformation.

• How can we find more matches?• Compute again a transformation which gives the

minimal least squares distance between the matched pairs.

• The pairs that survive the second transformation are candidate additional matches.

Page 49: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

49

Extending - intro.

• :# of iterations to extend each seed match (small constant).

• ε - maximum allowed distance.• At iteration i we extend the match to

contain pairs of atoms that lie at a maximum distance of

itN

it

i

N

Page 50: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

50

Extending - algorithm

• For iteration i:• Find the transformation of the current match

using least squares procedure.• Transform the target according to this

transformation.• Remove pairs from the current match that lie in

a distance larger than • Extend the match by heuristic matching

algorithm (given a threshold value).

it

i

N

Page 51: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

51

Extending - cont.

• After iterations, repeat the first 3 steps to refine the last matching.

• Complexity: as the heuristic matching algorithm

( or ) • Output: the best extended matches.• A remainder: What is “best”?

• # of matching pairs• Minimal RMSD between the matching atoms.

itN

( )O n 3( )O n

Page 52: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

52

Preprocessing Optimization

• We can do better (complexity wise)!

• Conclusion: the triangles we will consider are those composed of three atoms whose atom-to-atom distances are below certain threshold.

• Assumption: there is spatial proximity between the atoms of the relevant matching substructures.

Page 53: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

53

Preprocessing Optimization - complexity discussion

• Maximum allowed distance between the atoms of the reference set: r1 = 5Å ( )

• Maximum allowed distance between a third point and the atoms of a reference set: r2 = 20Å

• Theoretically, the complexity is now • Practically,• Example: 138 residues

13,359 triangles

( )o n

r1r1 r2

2( )o n

101Å = 10 m

Page 54: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

54

Implementation - Examples

http://bioinfo3d.cs.tau.ac.il/

c_alpha_match/prog.html

6LYZ vs. 2LZM

Result 1

Page 55: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

55

Implementation - Examples

1pmy vs. 1pza 1pmy vs. 1aaj

Page 56: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

56

“Rasmol” example

Page 57: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

57

Results of the algorithm

• 3-D comparison method that isn’t constrained by linear order of the amino acid chain.

• Self comparison - outputs the best match besides the trivial one. Could not be obtained in a sequence-dependent method.

• Successful on a wide range of protein comparison problems.

Page 58: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

58

Method discussion - cont.• 2 factors in structural comparison (might be conflicting):

• Sequential order conservation.• Geometric pattern conservation.

• Most of known methods: strict constraint has been placed on the search - sequential order conservation.

• Much easier (structural alignment is NP-Hard).• Linear order conservation isn’t necessarily undesirable

• Comparing proteins whose evolutionary relatedness is certain

• But neither desirable• If the exact evolutionary relationship between the structures

is unknown• Possible generic mutations could have occurred

Page 59: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

59

Method discussion

• Sequence independent:• Help find common 3-D folding units• Dealing with the question of convergence to a

similar structure or divergence from a common ancestor.

• Classical example: TIM barrel proteins.• Demonstrates that a strictly linear match is not

the best geometrical match between two barrel structures.

Page 60: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

60

Method summary• Based on the geometric hashing paradigm.• Pure 3-D approach (sequence-independent).• No a-priori knowledge of the motifs nor an initial

alignment are required.• Not sensitive to insertions, deletions, gaps or

displacements of equivalent substructures between the molecules being compared.

• Efficient and fully automated.• Seconds for typical pairwise comparisons.• Successful on a wide range of protein comparison

problems.

Page 61: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

61

Method summary - cont.

• In most of the examples, the best match corresponds to a linear alignment match.

• Provides a way to compare proteins without the bias of other methods (sequence dependent).

• Capable of discovering partial structural similarities.

• Sole criterion: geometry!• Complexity: O(n3)

Page 62: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

62

Extensions and additional features - a look ahead

• The method can be extended to allow simultaneous and efficient comparison of a target structure with a data base of many model structure.

• Protein and amino acid properties can be exploited in the definition of the reference frame and thus taken into consideration in the algorithm.

• Different choices of interest points.• Strategies to reduce the # of triangles.• Assigning weights to the matches according to certain

factors (recognition phase change).• Extending and adapting the technique to be used in the

docking problem.

Page 63: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

63

Lecture summary• 3D observation of proteins.• Why structural alignment?

• Studies of catalogued motifs can aid in understanding the evolutionary relationship between the proteins.

• The method presented allows addressing the question of # of protein structural classes found in nature.

• In particular, the availability of such a library is expected to aid in the investigation of the protein folding problem.

• Sequence alignment vs. structure alignment.• Geometric hashing and it’s use in the algorithm.• The algorithm and it’s implementation.• Extensions and additional features - a look ahead.

Page 64: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

64

Page 65: 1 Seminar in structural bioinformatics Pairwise Structural Alignment Presented by: Dana Tsukerman.

65

That’s it…