Surface motifs by a computer vision technique: Searches, detection, and implications for...

15
PROTEINS Structure, Function, and Genetics 16278-292 (1993) Surface Motifs by a Computer Vision Technique: Searches, Detection, and Implications for Protein-Ligand Recognition Daniel Fischer,'" Raquel Norel,' Haim Wolfson,' and Ruth N u s s i n o ~ ~ ~ ~ 'Computer Science Department, School of Mathematical Sciences, and 'Sackler Institute of Molecular Medicine, Faculty of Medicine, Tel Aviv University, Tel Aviv 69978, Israel, and 3Laboratory of Mathematical Biology, PRIiDyncorp, NCI-FCRF, Frederick, Maryland 21 702 ABSTRACT We describe the application of a method geared toward structural and sur- face comparison of proteins. The method is based on the Geometric Hashing Paradigm adapted from Computer Vision. It allows for comparison of any two sets of 3-D coordinates, such as protein backbones, protein core or pro- tein surface motifs, and small molecules such as drugs. Here we apply our method to 4 types of comparisons between pairs of molecules: (1) comparison of the backbones of two protein do- mains; (2) search for a predefined 3-D c, motif within the full backbone of a domain; and in particular, (3) comparison of the surfaces of two receptor proteins; and (4) comparison of the surface of a receptor to the surface of a ligand. These aspects complement each other and can contribute toward a better understanding of protein structure and biomolecular recogni- tion. Searches for 3-D surface motifs can be car- ried out on either receptors or on ligands. The latter may result in the detection of pharma- cophoric patterns. If the surfaces of the binding sites of either the receptors or of the ligands are relatively similar, surface superpositioning may aid significantly in the docking problem. Currently, only distance invariants are used in the matching, although additional geometric surface invariants are considered. The speed of our Geometric Hashing algorithm is encourag- ing, with a typical surface comparison taking only seconds or minutes of CPU time on a SUN 4 SPARC workstation. The direct application of this method to the docking problem is also dis- cussed. We demonstrate the success of this method in its application to two members of the globin family and to two dehydrogenases. 0 1993 Wiley-Liss, Inc. INTRODUCTION Protein-protein and the more general receptor- ligand interactions occur via molecular recognition. Such recognition involves matchings of their sur- faces. There are two critical aspects to such surface matchings, geometry and chemistry. Geometry is fundamental as the surface atoms of the two mole- cules have to be positioned within acceptable dis- tances from each other. Furthermore, the simple geometric criteria should not be violated, that is, atoms belonging to the two, complexed molecules cannot occupy the same positions in space. Within this framework of geometry,the chemical properties of the atoms and atomic groups play a critical role. It is often the case that a given receptor molecule can bind different ligands. The converse, however, also holds, namely, different receptor molecules can bind the same ligand albeit with either similar or different affinities. If we assume that a given ligand binds different receptors through the same patch of the ligand's surface, then those receptors are ex- pected to possess similar surfaces. Thus, finding a recurring surface motif in different proteins can in- dicate either (1) that the protein-receptors interact with the same protein-ligand (or peptide) and thus that the surface motif can conceivably play a similar role or have the same mechanism in the proteins containing it, (2) that the protein receptors could be inhibited by the same protein-ligand, (3) that the receptor-protein can bind a similar DNA or RNA structure, or (4) that the receptor proteins could be blocked by the same drug. Searches for surface motifs can follow two ap- proaches: we can look for surfaces which are geomet- rically similar or for surfaces possessing similar chemical properties. Whereas the former is based on shape descriptors, the latter considers charges, hy- Key words: protein structural comparison, 3-D protein motifs, surface motifs, docking, computer vision, geomet- ric hashing 0 1993 WILEY-LISS, INC. Received September 4,1992; revision accepted January 29, 1993. Address reprint requests to Dr. Ruth Nussinov, Laboratory of Mathematical Biology, PRIiDyncorp, National Cancer Insti- tute-FCRF, Bldg. 469, room 151, Frederick, MD 21702.

Transcript of Surface motifs by a computer vision technique: Searches, detection, and implications for...

Page 1: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

PROTEINS Structure, Function, and Genetics 16278-292 (1993)

Surface Motifs by a Computer Vision Technique: Searches, Detection, and Implications for Protein-Ligand Recognition

Daniel Fischer,'" Raquel Norel,' Haim Wolfson,' and Ruth N u s s i n o ~ ~ ~ ~ 'Computer Science Department, School of Mathematical Sciences, and 'Sackler Institute of Molecular Medicine, Faculty of Medicine, Tel Aviv University, Tel Aviv 69978, Israel, and 3Laboratory of Mathematical Biology, PRIiDyncorp, NCI-FCRF, Frederick, Maryland 21 702

ABSTRACT We describe the application of a method geared toward structural and sur- face comparison of proteins. The method is based on the Geometric Hashing Paradigm adapted from Computer Vision. It allows for comparison of any two sets of 3-D coordinates, such as protein backbones, protein core or pro- tein surface motifs, and small molecules such as drugs. Here we apply our method to 4 types of comparisons between pairs of molecules: (1) comparison of the backbones of two protein do- mains; (2) search for a predefined 3-D c, motif within the full backbone of a domain; and in particular, (3) comparison of the surfaces of two receptor proteins; and (4) comparison of the surface of a receptor to the surface of a ligand. These aspects complement each other and can contribute toward a better understanding of protein structure and biomolecular recogni- tion. Searches for 3-D surface motifs can be car- ried out on either receptors or on ligands. The latter may result in the detection of pharma- cophoric patterns. If the surfaces of the binding sites of either the receptors or of the ligands are relatively similar, surface superpositioning may aid significantly in the docking problem. Currently, only distance invariants are used in the matching, although additional geometric surface invariants are considered. The speed of our Geometric Hashing algorithm is encourag- ing, with a typical surface comparison taking only seconds or minutes of CPU time on a SUN 4 SPARC workstation. The direct application of this method to the docking problem is also dis- cussed. We demonstrate the success of this method in its application to two members of the globin family and to two dehydrogenases. 0 1993 Wiley-Liss, Inc.

INTRODUCTION Protein-protein and the more general receptor-

ligand interactions occur via molecular recognition. Such recognition involves matchings of their sur- faces. There are two critical aspects to such surface matchings, geometry and chemistry. Geometry is fundamental as the surface atoms of the two mole- cules have to be positioned within acceptable dis- tances from each other. Furthermore, the simple geometric criteria should not be violated, that is, atoms belonging to the two, complexed molecules cannot occupy the same positions in space. Within this framework of geometry, the chemical properties of the atoms and atomic groups play a critical role.

It is often the case that a given receptor molecule can bind different ligands. The converse, however, also holds, namely, different receptor molecules can bind the same ligand albeit with either similar or different affinities. If we assume that a given ligand binds different receptors through the same patch of the ligand's surface, then those receptors are ex- pected to possess similar surfaces. Thus, finding a recurring surface motif in different proteins can in- dicate either (1) that the protein-receptors interact with the same protein-ligand (or peptide) and thus that the surface motif can conceivably play a similar role or have the same mechanism in the proteins containing it, (2) that the protein receptors could be inhibited by the same protein-ligand, (3) that the receptor-protein can bind a similar DNA or RNA structure, or (4) that the receptor proteins could be blocked by the same drug.

Searches for surface motifs can follow two ap- proaches: we can look for surfaces which are geomet- rically similar or for surfaces possessing similar chemical properties. Whereas the former is based on shape descriptors, the latter considers charges, hy-

Key words: protein structural comparison, 3-D protein motifs, surface motifs, docking, computer vision, geomet- ric hashing

0 1993 WILEY-LISS, INC.

Received September 4, 1992; revision accepted January 29, 1993.

Address reprint requests to Dr. Ruth Nussinov, Laboratory of Mathematical Biology, PRIiDyncorp, National Cancer Insti- tute-FCRF, Bldg. 469, room 151, Frederick, MD 21702.

Page 2: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

SURFACE MOTIFS BY A COMPUTER VISION TECHNIQUE 279

drogen-bond formation, exposeaburied surface area, and hydrophobic interactions. In this work we focus on geometry.

Thus, the problem we are faced with is as follows. Given the three-dimensional coordinates of two (or more) receptors (or ligands) find similar 3-D patches of their surfaces. There are four major difficulties in geometric surface matching. The first of these con- cerns the fact that the problem is a three-dimen- sional pattern matching. The data we possess are the atomic coordinates. If the 3-D pattern is pre- defined and is small, then the problem is easier, and can be approached using, for example, the subgraph isomorphism technique.’ If, however, we are con- cerned with the more general problem, where we assume no such knowledge, then the problem is much more complex. In such a case, we need to com- pare every segment of the surface of one protein against every surface segment of the second protein. Furthermore, the size of the pattern is also un- known, which leads to the second difficulty: the com- binatorics of the search for (a priori unknown) sim- ilarity between two surfaces is large since the number of surface atoms in proteins is considerable. Thus, the size of the surface representation used as input to the matching algorithm is critical. The third difficulty concerns the decision as to which shape descriptors to use in the matching. Several alternatives are available. We can use the “raw” co- ordinates of the surface atoms. We can calculate the solvent accessible surface and use the coordinates of some surface dot^^,^ or we could choose special sur- face dots4 such as those at maximal or minimal cur- vature. Using these we can describe the surface in terms of “knobs” and “holes” enabling the matching of “knobs” with “knobs” and “holes” with “holes.” Alternatively, the molecular surface can be divided into domains with different curvature profiles facil- itating focusing only on “interesting” surface ~ a t c h e s . ~ In particular, when comparing the sur- faces of the active sites of two proteins, we can choose dots close to the surface that presuably are on, or close to the active sites.6

The fourth major difficulty is embedded in the fact that the surfaces of proteins are not rigid. Pairs of connected atoms rotate about their mutual bonds. Owing to this flexibility, the coordinates of corre- sponding atoms in similar proteins may be shifted. Furthermore, surface motifs are more variable than interior, volume motifs.

To enable matching of surfaces, what is needed is an efficient, automated algorithm for routine scan- ning of large 3-D databases. Such an algorithm should match the shape descriptors of two (or more) molecules of any size, where every piece of one is compared with every piece of the other indepen- dently of the rotation and translation of the pieces with respect to the full proteins.

In general, in order to compare two 3-D struc-

tures, one is required to determine an equivalence between analogous atoms and find the transforma- tion that best superimposes the structures. Auto- mated methods for finding the best superposition (least squares fitting) given an initial alignment are available.’ However, the tedious determination of analogous atoms in the two sets is done manually. Since such equivalencing is not obvious, this is fre- quently a very difficult task.

To date, automated methods for protein structural comparison have focused on the comparison of the protein backbones These methods find the best superposition of the backbone of one protein onto the backbone of the other by equivalencing pairs of atoms that occupy equivalent positions in each protein. An absolute requirement of all these methods is that the sequential order of the amino acids be conserved. This constraint is called the “progression rule.”* Techniques which allow for some violations of the progression rule have recently been Nevertheless, as the compari- son is performed by the matching of contiguous frag- ments,lo*’’ or by the matching of the secondary structure elernents,l2 a strict linear order is re- quired within the matching units. Linear, backbone comparison is of interest in the cases where the glo- bal folds of the two proteins are relatively similar and large segments are superimposable. Neverthe- less, if we are searching for structural similarity be- tween two proteins that are (relatively) dissimilar or if we are comparing protein surfaces, such methods are unsuitable. Approaches to the general pattern matching problem in proteins have been pro- po~ed,’~.’~ but due to their large complexity, they have only been applied to searches of small patterns. In particular, we are interested in (1) comparing any set of atoms from two proteins in a sequence-inde- pendent way and in (2) searching for 3-D recurring motifs that are composed of noncontiguous or iso- lated residues (e.g., surface motifs).

We have developed a technique for sequence-inde- pendent protein structural comparison, which is based on the Geometric Hashing Paradigm for model- based object recognition in Computer Vision.15 Our method allows for 3-D structural comparison of any set of coordinate data. It searches for nonpredefined, sequence order-independent matches, requires nei- ther prior knowledge of the structures nor an initial alignment of the proteins. The method can discover any common 3-D substructures between two sets of coordinates. It is efficient (below a third degree com- plexity in the input size), fully automated, and is not constrained to local or linear sequence-dependent

*For elements i and k from one sequence and elementsj and 1 from the other sequence, if element i is matched to elementj and element k is matched to element I , and if k is to the right of I in the first sequence, then 1 must also be to the right ofj in the second sequence.

Page 3: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

280 D. FISCHER ET AL.

motifs. The method can find 3-D motifs contributed by different segments or by isolated single amino acids. It may thus find motifs in protein active sites, on surfaces, in cores, etc. It requires no manual in- tervention and produces every piece of structural similarity between the proteins being compared. Thus, our procedure is efficient for both nonpre- defined motifs, or when searching for occurrences of a known motif in new proteins, as well as for pair- wise structural comparsions. Being unbiased by the sequence of the chain, our approach is particularly useful when dealing with the question of divergence versus convergence, especially for relatively dissim- ilar structures. Extensive application of our method to the comparison of protein backbones has been re- ported. 16-18

When the algorithm is applied to the comparison of two protein surfaces, it can aid in (1) the detection of similar geometries in the proteins’ active sites and in (2) finding the correct docking orientation of a ligand surface onto a receptor surface. In addition, the latter can be realized if the orientation of a ligand onto the surface of one of the receptors is known, and if a similar surface is detected in the second receptor (see Figure 1 and Discussion).

Our method thus provides a solution for the first two of the above mentioned difficulties. The choice of surface representation, given as input to our method, is based on previously developed tech- niques.’P6 In this paper, we will not deal with the fourth difficulty. We handle only rigid body match- ing.

Some of the current methods for solving the dock- ing problem include a geometric search for similar- ity between two surfaces.2*6 Clearly, the geometric search phase of already existing algorithms devel- oped for docking might also serve as a basis for solv- ing the surface comparison problem. However, these methods have not been applied to searches for sur- face similarity between two proteins. Furthermore, our method is extremely fast and efficient, requiring only seconds or minutes for typical comparisons con- taining hundreds of atoms.

Below, we describe the application of our method to a range of different types of comparisons between two proteins: (1) backbone comparison, (2) search for recurring 3-D motifs, and in particular (3) surface comparison of two proteins and (4) comparison of the surface of a ligand to the surface of a receptor. These types of comparisons complement each other. Apply- ing them to pairs of molecules contributes to the understanding of their structural similarity and the mechanisms of their action.

The problem of comparing the surface of a ligand with the surface of a receptor is directly related to the problem of docking a ligand onto a receptor. There are, however, two basic differences between searches for surface similarities or surface motifs and docking. First, in docking favorablelunfavorable

chemical interactions are needed to be taken into account. Second, when comparing two protein sur- faces, maximum overlap is desired, whereas in dock- ing, overlap which results in atomic collision, is dis- allowed. In addition, in docking, geometric complementarity is sought rather than similarity. When searching for similarity between two surfaces, the orientation (i.e., the normals) of the surfaces must be similar, whereas in docking, the orientation must be in opposite direction. However, since cur- rently our matching stage is based on intersphere distances only (no surface orientation is used), the latter difference does not apply.

Here we show that in some ligand-receptor com- plexes whose surfaces are relatively spatially simi- lar, our method can efficiently find such similarity and can determine the best orientation of a ligand‘s surface with respect to the receptor’s surface. Cur- rent docking methods have larger complexities than ours. Thus, inclusion of chemical and geometric sur- face properties in a method based on a similar tech- nique may provide an efficient docking algorithm. Such work is currently in progress.

METHODS The Matching Algorithm

The structure comparison problem can be stated as follows. Given the 3 - 0 coordinates of the atoms of two different molecules, find a rigid transformation (rotation and translation) in space, so that a “suffi- cient” number of atoms of one molecule matches the atoms of the other molecule. Note that we do not constrain ourselves to the linear order of the amino acid sequence.

We assume that the structures to be compared are described by sets of interest points and their 3-D co- ordinates. For example, one may consider C, or sur- face atoms of the proteins as interest points. One of the molecules to be compared will be referred to as the “model” and the other as the “target.”

Next, we briefly outline the three major steps of our approach (for a detailed description see ref. 17).

1. Finding seed matches. The first step of the al- gorithm searches through the structures to find (rel- atively small) candidate initial matches which we call “seed matches.” This step requires an extensive search on the structures. An algorithm based on the Geometric Hashing Paradigm is applied to generate these seed matches. In order to restrict the search in this first step, each molecule is covered by a set of “balls” of a prespecified radius, and matching struc- tures are discovered only within single “balls.” A seed match is represented by a list of matching pairs of atoms and by a 3-D rigid transformation (rotation and translation). The number of the matching pairs should be above a threshold (minimal score), which is either a static or a dynamic parameter of the al- gorithm. Each pair in the list specifies a correspon-

Page 4: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

SURFACE MOTIFS BY A COMPUTER VISION TECHNIQUE 281

dence between an atom from one structure and an atom from the other structure. The transformation represents the 3-D rotation and translation, which superimposes the atoms of the first structure onto the corresponding atoms of the second structure. Note that this step may produce hundreds or thou- sands of candidate seed matches, some of them hav- ing (almost) the same transformation, obtained from different pairs of ?balls.” In the Appendix we include a further description of this step.

2. Clustering of the seed matches. In the second step those seed matches that represent almost iden- tical transformations are clustered, and their corre- spondence lists are merged. Since a 3-D rigid motion can be described by six parameters, three for rota- tion and three for translation, we use these param- eters to cluster the candidate seed matches.

The clustering algorithm iteratively joins trans- formations into groups according to the proximity of their parameters. The distance between two seed matches (transformations) is defined as a six-dimen- sional distance between their parameter vectors.

At the end we are left with a relatively small number of significant clusters. Each cluster repre- sents one transformation obtained from the individ- ual transformations that were joined into the group. The seed match of a group is obtained by choosing matching pairs from the original seed matches that compose the group. To improve accuracy we choose only pairs that appear at least in a certain percent- age of the seed matches.

3. Extending the seed matches. The relatively re- liable correspondence lists of the seed matches ob- tained by Step 2 are extended to contain additional matching pairs. This is done by first transforming one of the structures according to the transforma- tion specified by the seed match of the cluster. Then, pairs of atoms that lie “close” enough after the transformation are candidate additional matches. Because proteins are usually quite dense in space, each atom from one structure may have several “close” neighbors in the other structure. To choose the appropriate pairs a heuristic iterative matching algorithm is applied which obtains the best compro- mise between minimizing the sum of the distances between all the matched pairs and maximizing the number of matching atom pairs.

At the end, the best extended matches are re- ported. The quality of the match is determined by the number of the matching pairs of atoms and by the least squares distance between these matching atoms.

Backbone vs. Surface Comparisons When comparing protein backbones we may select

any backbone atom combination of each protein. Here we used C, atoms only. Typically, Step 1 of our algorithm obtains hundreds of seed matches. The

clustering procedure helps to reduce the problem of local displacements in the structures. After cluster- ing, we extend the clusters. Usually, only one “best” solution is obtained (i.e., all the other solutions con- tain considerably less matched pairs of atoms and have a larger rms).

In the surface comparisons, we chose a subset of the surface atoms of each protein. This choice is de- scribed below. For surface comparisons, Step 1 of our algorithm usually obtains thousands of seed matches. In this case, we do not cluster the seed matches before extending them. Each seed match is extended and outputed as soon as it is produced. Thus, we obtain thousands of solutions. We rank all these solutions according to the number of matched pairs and their rms.

Surface representation In addressing the problem of searching for similar

surface patterns (motifs) or comparing the surfaces of a ligand and a receptor, we use Connolly’s Molec- ular Surface (MS) program2 which is based on Rich- a r d ~ ’ ’ ~ definition of the solvent excluded molecular surface. In this method a probe sphere (a water mol- ecule) is rolled over the protein. The points where the probe touches the van der Waals spheres define the molecular surface. This surface representation usually contains thousands of surface points at a density specified by the user. Since this representa- tion may contain several points per atom the num- ber of surface points is too large to handle effi- ciently. Furthermore, this is a global surface representation describing the whole surface of the protein. Kuntz’s SPHGEN program6 reduces the number of surface points and focuses on local invagi- nated surface patches. This program constructs spheres touching two molecular surface points. The number of spheres is subsequently reduced and the remaining intersecting ones are clustered. Potential active sites are often described by the largest clus- ter. This clustered spheres surface representation results in only one point per atom and usually its size is in the low hundreds. We have chosen this representation as input to our program because of two main advantages. First, the surface is obtained by a fully automated process. Second, the spheres’ centers have a Cartesian representation. This suits our current implementation, since our current matching procedure uses distance invariants only. Recently, a new subclustering procedure has been developed20z21 where the subclusters contain spheres with smaller radii, further focusing on local regions of invaginations. As this procedure is not fully automated, here we use the former represen- tation. Nevertheless, these representations lack any direct curvature information of the surfaces. In ad- dition, flat, or patches of surface having low curva- ture cannot be described by intersecting sphere clus- ters. Other choices of surface representation,

Page 5: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

282 D. FISCHER ET AL.

coupled with a matching algorithm using geometric surface invariants, are under development.

It is important to note here that our method ac- cepts any set of coordinates as input and is not con- strained to the above chosen surface representation. We followed Kuntz’s work in our choice of the sur- face representation. This is a convenient choice when comparing the surface of a ligand with that of a receptor. Nevertheless, when comparing the sur- faces of two proteins, we chose a slightly different approach. We still use the SPHGEN program to de- tect the largest protein invagination, but in order to compare the original structures, instead of using the coordinates of the spheres in the SPHGEN output, we used the PDB coordinates of those atoms selected in the clustering procedure of SPHGEN.

RESULTS Results for backbone comparisons have been re-

ported.16-” In order to test the potential accuracy and efficiency of our method we have examined typ- ical structure comparison problems and compared our results with published ones of previous methods. In these examples, the matches produced by our method are equivalent to those of conventional alignment techniques. The fact that in these cases our 3-D matching algorithm produces equivalent re- sults to previous alignments demonstrates that in these examples, where the structures are generally similar, the best structural match corresponds to a sequential alignment. It should be noted, however, that as no information about the order of the resi- dues in the primary chain has been exploited by the algorithm, this is not trivial. In other examples where the structures are less similar, our results show some novel “real” three-dimensional, se- quence-independent matches (i.e., not conserving the progression rule). In addition, we have searched for particular structural motifs within a protein. The motif can be a sequential motif or a real 3-D, nonsequential one. In these examples our method correctly detects the motif within the protein. To our best knowledge there is no efficient method in the literature that could execute such 3-D motif searches.

Here we show the application of our method to 4 types of comparisons between pairs of structures: (1) backbone comparison between two domains, (2) search for a predefined 3-D motif within the full backbone of a domain, (3) comparison of the surfaces of two domains, and (4) comparison of the surface of a receptor to the surface of a ligand. Each type of comparison focuses on different aspects of the pro- teins’ similarities. These aspects complement each other and contribute to a better understanding of protein structure-function and of biomolecular rec- ognition. We apply these comparisons to two mem- bers of the globin family and to two dehydrogenases.

TABLE Ia. Backbone Comparison of 4MBN and 4HHB*

Model: 4MBN backbone Target: 4HHB backbone (153 mints) (141 noints)

TRAN 2.9, 14.7, -5.9 ROT -0.35, 0.51, -1.30 rms 1.32 Pairs 130 *A summary of the comparison of the backbone of myoglobin (PDB code: 4MBN) with the backbone of the a-subunit of he- moglobin (PDB code: 4HHB). ROT and TRAN are the ROTa- tion and TRANslation given in radians and in A, respectively. The points of the model and target are the C, atoms. This comparison required 17.2 see of CPU time.

Globins The globin family is a classical example for pro-

tein comparison. Since the beginning of this century the globins have been an object of constant study. These studies range from analysis of crystals of he- moglobin, to sequence comparisons and during the last decades, to structural comparison of X-ray crys- tallographic data (for a review and an exhaustive analysis and comparison of the atomic structures of nine different globins, see ref. 22).

The known globins have different amino acid se- quences but remarkably similar secondary and ter- tiary structures. They are mainly composed of a-he- lices which assemble in a common pattern, enclosing the heme group in pockets of similar ge- ometry made up from homologous portions of the molecules.

Table I is a summary of the results from the four types of comparisons between two hemoglobin do- mains: the a subunit of hemoglobin (Protein Data Bank,23 PDB code: 4HHB) and myoglobin (PDB code: 4MBN). In what follows we discuss each of the comparisons which we have carried out.

Backbone comparison We compared the C, atoms of 4HHB with those of

4MBN. 4HHB contains 141 C, atoms, whereas 4MBN has 153. Our results (see Table Ia) match 130 C, atoms with an rms of 1.32 A. Our equivalences are equivalent to those reported:’ except for slightly minor differences in some matched pairs of residues between the helices.

Additional comparisons between pairs of globin domains have also been carried out. Our results are equivalent to previously published ones (results not shown). It should be noted that as no information about the order of the residues in the primary chain has been exploited by the algorithm, these results are not trivial. Obtaining an alignment that con- serves the sequence order without using sequence information demonstrates that the structural simi- larity between the proteins is one that conserves the

Page 6: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

SURFACE MOTIFS BY A COMPUTER VISION TECHNIQUE 283

TABLE Ib. Motif Search From 4MBN Within 4HHB, 2LHB. and 2DHB Backbones*

Model 4MBN Target 4HHB Target BLHB Target 2DHB motif backbone backbone backbone (21 points) (141 points) (150 points) (141 points)

142 I

138 F

107 I

104 L 103 Y

99 I 98 K 97 H

93 H 92 S

89 L

72 L 71 A

68 V 67 T

64 H

46 F 45 R 44D 43 F 42 K

..... 135 V .....

132 V

101 L

98 F 97 N

93 v 92 R 91 L

87 H 86 L

83 L

66 L 65 A 64 D 63 A 62 V 61 K

58 H

46 F 45 H 44P 43 F 42 Y

.....

.....

.....

.....

.....

.....

.....

.....

..... TRAN 3.4, 14.9, -6.9 ROT -0.45, 0.54, -1.33 r m s 1.32 Pairs 21

..... 144 L .....

140 M

119 L

116 F 115 Y

111 v 110 Q 109 F

105 H 104 K

101 L

81 V 80 A 79 N 78 I 77 I 76 R

73 H

55 F 54 K 53 P 52 F 51 F

.....

.....

.....

.....

.....

.....

.....

.....

..... TRAN -12.7, 28.4, -21.6 ROT 1.96, 0.17, 1.47 rms 1.01 Pairs 21

..... 135 V .....

132 V

101 L

98 F 97 N

93 v 92 R 91 L

87 H 86 L

83 L

66 V 65 G 64 D 63 A 62 V 61 K

58 H

46 F 45 H

.....

.....

.....

.....

.....

.....

.....

.....

43 F 42 K .....

TRAN 20.0, 21.5, -21.0 ROT 2.7, 0.94, 2.38 rms 1.12 Pairs 20

*Search for the C, atoms of the heme pocket of the a-subunit of hemoglobin (PDB code: IHHB), sea lamprey hemoglobin (PDB code: BLHB), and horse hemoglobin (PDB code: BDHB) using the 3-D motif defined from myoglobin (PDB code: IMBN). This motif is composed of 21 C, atoms from the heme pocket of IMBN. The table shows the residues from the targets that were matched to the heme pocket of IMBN, thus identifying the heme pocket within IHHB, SLHB, and IMBN, respectively. The points of the model are the selected C, atoms (see text) and the target points are all the C, atoms. These comparisons required about 5 sec of CPU time each. Symbols as in Table Ia.

order in the chains. Thus, these results confirm the evolutionary relationship between the globins.

Searching for a predefined m o t i f within a f i l l domuin

We use the C, atoms from the heme pocket of 4MBN as a predefined motif. The 3-D motif is com- posed of 21 isolated C, atoms that form the heme pocket of 4MBN. The 21 C, atoms belong to the fol- lowing residues: 42-K, 43-F, 44-D, 45-R, 46-F, 64-H,

99-1, 103-Y, 104-L, 107-1, 138-F, and 142-1. 67-T, 68-V, 71-A, 72-L, 89-L, 92-S, 93-H, 97-H, 98-K,

To search for the recurrence of this motif within 4HHB, we applied our algorithm with the 21-atom motif as model and all 141 C, atoms from 4HHB as target.

The residues from the 4HHB domain that match residues from the motif define the recurring motif within the full domain, thus identifying the heme pocket within it. This motif search could hardly be obtained using linear methods. Our results (see Ta- ble Ib) are identical to those observed (by visual in- spection) by Lesk and Chothia.22 Our method cor- rectly matched the 21 atoms from the motif to the

Page 7: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

284 D. FISCHER ET AL.

TABLE Ic. Comparison of the Surfaces of 4MBN and 4HHBt

Model: 4MBN surface Target: 4HHB surface (88 points) (79 points)

No. of solutions obtained: 60 10 toD rankine matches

sol. no. ROT TRAN R T Score IlnS

1 -1.6 -0.1 -2.5 24.7 23.7 -24.2 2.72 42.00 61 2.5 2 -3.1 -0.6 1.7 34.5 32.9 -13.9 2.62 49.79 60 2.3 3* -0.4 0.5 -1.3 3.8 14.2 -5.8 1.29 15.92 60 1.9 4 -1.9 -1.0 -0.4 34.5 19.5 -5.8 2.28 40.10 59 2.4 5* -0.3 0.4 -1.1 3.6 13.0 -3.8 1.18 14.07 59 2.0 6 -1.6 -0.0 -2.5 24.1 24.7 -24.5 2.73 42.39 58 2.4 7 0.3 0.6 2.5 -1.0 35.7 -9.1 2.44 36.94 58 2.5 8* -0.4 0.5 -1.3 4.0 14.6 -6.3 1.34 16.47 57 1.8 9* -0.4 0.4 -1.2 4.5 13.8 -5.4 1.28 15.57 57 1.8

10* -0.4 0.6 -1.5 3.5 17.5 -8.5 1.49 19.83 56 2.0 +The 10 top ranking matches from the comparison of the surface of myoglobin (PDB code: 4MBN) with the surface of the a-subunit of hemoglobin (PDB code: 4HHB). ROT and TRAN are the ROTation and TRANslation, respectively. R is the arccos [tr(M) - 11/2, where M is the rotation matrix and t r is the trace (diagonal) of M. T is the square root of the sum of the squares of the three translations. Score is the number of matched pairs between both surfaces. rms is the root mean square of the match. Five out of the 10 top ranking results are close to the transformation that best superimposed the surfaces (compare with Tables Ia and Ib). These are marked with an asterisk. The points of the model and target are the selected surface atoms (see text). This comparison required 12.9 sec of CPU time.

TABLE Id. Comparison of the Surfaces of the Heme of 4MBN With the Surface of 4HHBt

Model: 4MBN heme surface Target 4HHB surface (43 points) (79 points)

No. of solutions obtained: 3,389 10 to^ rankine: matches

Sol. no. ROT TRAN R T Score IlnS

1* 2* 3* 4* 5* 6* 7* 8 9

10*

-0.4 0.5 -0.4 0.5 -0.4 0.5 -0.4 0.4 -0.4 0.4 -0.4 0.4 -0.4 0.4

3.0 0.0 2.6 0.3

-0.4 0.5

-1.3 -1.3 -1.1 -1.2 -1.1 -1.1 -1.1 -0.2 -2.0 -1.4

3.5 13.3 2.8 13.9 3.1 13.6 4.0 12.4 3.8 12.8 3.8 12.7 3.5 12.8

22.6 33.5 14.9 7.5 3.5 13.4

-7.0 -7.3 -4.2 -4.3 -4.1 -4.0 -3.9 17.5 4.2

-7.5

1.37 1.38 1.21 1.20 1.18 1.18 1.17 2.97 3.08 1.40

15.45 16.02 14.63 13.80 13.99 13.94 13.89 44.10 17.24 15.78

42 42 42 42 42 42 42 42 42 41

1.8 1.8 1.7 1.7 1.7 1.7 1.6 1.5 1.7 1.8

'Top 10 results from the comparison of the surfaces of the heme in the myoglobin (PDB code: 4MBN) with the surface of the a-subunit of hemoglobin (PDB code: 4HHB). Symbols are as in Table Ic. Eight out of the 10 top ranking results are close to the transformation that best superimposes the surfaces (compare with Tables Ia, Ib, and Ic). The points of the model and target are the selected surface atoms (see text). This comparison required 124 sec of CPU time.

corresponding atoms of the 4HHB domain with an rms of 1.32 A.

Note that the transformation and rms obtained in the search of this motif are very similar to the trans- formation and rms obtained when comparing the full set of C, atoms (see above).

Additional searches for this motif were also car- ried out on several other globin domains (see Table Ib). In all cases, our method correctly identified the heme pocket in each domain.

Comparing the active site surfaces of two globins

In this comparison, we compare two surface mo- tifs. The surface motifs are composed of the atoms lying on the surface of the heme pockets of 4HHB and 4MBN.

The molcular surface was computed as described in Methods. We first applied Connolly's MS pro- gram' and next, to select the atoms on the surface of the heme pockets we used the SPHGEN program,6

Page 8: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

SURFACE MOTIFS BY A COMPUTER VISION TECHNIQUE 285

which identifies invaginations in a molecular sur- face. The coordinates of the atoms corresponding to the clustered spheres were taken from the original PDB files.

These are real 3-D motifs, containing only a col- lection of surface atoms (except for hydrogens) with- out any sequential order. The 4HHB surface is com- posed of 79 atoms chosen from 34 residues of the a-hemoglobin and the 4MBN surface contains 88 at- oms chosen from 35 residues of myoglobin.

Several transformations were produced by our method. We ranked these solutions according to the number of matched pairs and their rms. Table Ic shows the 10 top ranking solutions. Five of these transformations are very similar to the one obtained from the comparisons of the full domains and the heme-pocket search described above. Most of the matched pairs of atoms in each of the top ranking solutions contain equivalent atoms from each sur- face. Note also that the correct solutions have the best rms distance.

This result demonstrates that the similarity be- tween these domains discovered by the backbone comparisons above also extends to the surfaces of their active sites. Finding such similarity indicates that the structure of the a-hemoglobin can geomet- rically accommodate the same ligand as myoglobin, despite the conformational differences between the crystals of 4MBN and 4HHB. Consequently, our next step is to search for the docking orientation of the heme of the 4MBN into a-hemoglobin. The transformation needed to be applied to the heme of 4MBN in order to dock it into a-hemoglobin should be similar to the one obtained for the active sites comparisons.

Comparing the heme surface of myoglobin with the active site surface of hemoglobin

In this example, the surface of the heme in the myoglobin complex (4MBN) is compared to the sur- face of the a-hemoglobin (4HHB). We use the surface representations provided by the SPHGEN program. The surface representation of the heme of the myo- globin complex has 43 atoms and the surface of 4HHB has 79 points (as in the above comparison). In this case, finding a good superposition of the two surfaces is equivalent to finding the best orientation of the docking of the heme into the active site of hemoglobin.

Geometrically, there are several different orienta- tions that superimpose these surfaces well. Table Id shows the 10 best scored results of this comparison. Eight of these transformations are similar to the correct docking orientation which can be identified as follows. From the comparisons of the surfaces of 4MBN and 4HHB above, we know what transforma- tion best superimposes the two active sites. We could expect that such transformation, if applied to the heme of 4MBN, would bring the heme to a favorable

TABLE IIa. Backbone Comparison of GADH and 3LDH*

Model: GADH backbone Target: 3LDH backbone (374 points) (329 points)

TRAN 12.9, 37.0, 34.3 ROT 3.02, -0.11, -0.54 rms 1.70 Pairs 124 *Summary of the comparison of the backbones of alcohol dehy- drogenase (PDB code: GADH) and lactate dehydrogenase (PDB code: BLDH). The points of the model and target are the C, atoms. This comparison required 765 sec of CPU time. Symbols as in Table Ia.

docking orientation into 4HHB. Such transforma- tions are shown in Table Id. Because of the twofold axis of symmetry of the heme, transformations plac- ing the heme in such symmetric orientations were also found, but with lower scores.

As an additional example, we compared the sur- face of the heme from 4MBN to its complexed recep- tor: myoglobin itself. As both receptor and ligand are taken from the same file containing the complex, and the coordinates are given under the same refer- ence frame, the correct docking orientation is a zero transformation. Our method succeeds in finding such a transformation for this comparison too.

Dehydrogenases The dinucleotide-binding fold of many dehydroge-

nases is an open, parallel six-stranded P-sheet with helices on both sides of the sheet. It is formed by two symmetrical halves, each with two P-cl-P motifs. Many painvise comparisons have been carried out between members of this family.24,25

Here we compare two dehydrogenases in the four types of comparisons noted above. Table 11 summa- rizes the results of these comparisons. We chose two dehydrogenase complexes: chain A of alcohol dehy- drogenase (PDB code: 6ADH) and lactate dehydro- genase (PDB code: 3LDH). Both dehydrogenases are complexed with the NAD ligand.

Backbone comparison We compared the C, atoms of GADH with those of

3LDH. GADH contains 374 C, atoms, whereas 3LDH has 329. Our results (see Table IIa) match 124 C, atoms with an rms of 1.7 A. Our equivalences are almost identical to those obtained by previous com- parisons.

Searching for a predefined motif within a full domain

In order to define a recurring 3-D motif in the dehydrogenase family, we first picked GADH as a typical dehydrogenase. Then, the motif was ob- tained by selecting the common core from three pair- wise comparisons between three dehydogenases:

Page 9: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

286 D. FISCHER ET AL.

TABLE IIb. Motif Search From GADH Within 3LDH Backbone*

Model GADH motif (56 points) (329 points)

Target SLDH backbone

316 G 315 K 314 W 313 T

294 V 293 G 292 V 291 I 290 V 289 S 288 V 287 G

269 I 268 V 267 E 266 F 265 S 264 F 263 D 262 V

242 N 241 V 240 C 239 E 238 T

226 K 225 N 224 I 223 D 222 v 221 G 220 I 219 I 218 R

..... 162 G 161 I 160 I 159 R

140 E 139 P 138 H 137 L 136 E 135 K 134 L 133 C

98 A 97 T 96 I 95 v 94 v 93 L 92 K 91 s 83 K 81 G 80 S 79 v 78 I

56 E 55 M 54 v 53 D 52 V 51 L 50 A 49 v 48 E

.....

.....

.....

.....

..... 210 G 40 V

208 I 207 V 206 S 205 L 204 G 203 v 202 G 201 G 200 L 199 G 198 F 197 V 196 A 195 C 194 T 193 S

..... 37 A 36 D 35 A 34 M 33 G 32 V 31 A 30 D 29 C 28 G 27 V 26 V 25 T 24 I 23 K 22 N .....

TRAN 12.9, 37.1, 33.4 ROT 3.02, -0.11, -0.50 ms 1.11 Pairs 51

'3-D motif search from alcohol dehydrogenase (PDB code: 6ADH) in lactate dehydrogenase (PDB code: 3LDH). Symbols as in Table Ib. The points of the model are the selected C, atoms (see text) and the target points are all the C, atoms. The table shows the residues from SLDH that were matched to the motif from 6ADH. This comparison required 71.1 sec of CPU time.

GADH, malate dehydrogenase and lactate dehydro- genase." The motif is composed of the first (Y helix of GADH and 6 segments of contiguous residues corre- sponding to the 6 S-strands of the sheet. The motif contains a total of 56 C, atoms (see Table IIb).

To search for the recurrence of our motif within 3LDH, we applied our algorithm with the 56-atom motif as model and all 329 C, atoms from 3LDH as target.

The residues from the 3LDH domain that match residues from the motif define the recurring motif within the full domain, thus identifying the recur- ring motif within 3LDH. This motif search could hardly be obtained using linear methods. Our re- sults (see Table IIb) are identical to those observed by previous comparisons. Our method correctly matched 51 atoms from the motif to the correspond- ing atoms of the 3LDH domain with an rms of 1.11

Note that the transformation obtained in the search of our motif is very similar to the transfor- mation obtained when comparing the full set of C,, atoms (see above). Also, the 51 atoms matched here, were matched to the same atoms in the full back- bone comparison above.

Extensive comparisons of some dehydrogenase backbones have also been carried out. Results simi- lar to those reported previously by sequence-depen- dent techniques have been obtained. As our program is not sequentially biased (it does not use any se- quence information whatsoever), we can conclude that the best 3-D similarity between the proteins in these cases, is such that it conserves their sequential order. This conclusion is trivial if the two sequences are known to be closely related, but it is not trivial when their degree of relatedness is unknown. In ad- dition, in comparisons between several a /p domains, structural similarities have been detected showing novel, out-of-sequential-order equivalences.18

Comparing the active site surfaces of two dehydrogenases

In this comparison, we compare two surface motifs composed of the atoms lying on the surfaces of the receptors from GADH and 3LDH. In this comparison we used the receptors' coordinates only (the NAD ligands were removed).

The surface, atoms were selected using the SPHGEN program as described in Methods and as in the previous comparisons. With the default pa- rameters, SPHGEN produces 168 spheres for GADH and 529 spheres for the 3LDH surface. Although these surface representations contain atoms belong- ing to the active sites, they also contain too many additional atoms, far away from the active sites. Af- ter more than 1 hr of CPU time and tens of thou- sands of generated orientations, we interrupted the run. Between all the orientations generated, several correct orientations were found. The large sizes of

A.

Page 10: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

SURFACE MOTIFS BY A COMPUTER VISION TECHNIQUE

TABLE IIc. Comparison of the Surfaces of GADH and 3LDH*

Model: GADH surface

287

Target: 3LDH surface (108 points) (216 points)

No. of solutions obtained: 18,718 7 out of the 1,000 top ranking solutions

Sol. no. ROT TRAN R T Score llllS

5 3.1 -0.2 -0.3 17.0 35.2 30.0 3.08 49.28 114 2.3 8 2.9 -0.3 -0.3 10.8 37.9 31.3 2.90 50.33 111 2.2

37 2.9 -0.2 -0.4 11.6 36.1 34.5 2.90 51.28 107 2.3 83 3.0 -0.0 -0.4 13.7 37.0 31.6 2.98 50.56 104 2.1

157 2.9 -0.4 -0.5 13.4 36.3 35.5 2.81 52.51 101 2.2 269 2.8 -0.2 -0.5 13.0 36.1 33.8 2.79 51.14 99 2.2 674 3.1 -0.0 -0.4 17.0 33.5 36.0 3.06 52.00 95 2.4 *Results from the comparison of the surface of alcohol dehydrogenase (PDB code: 6ADH) with the surface of lactate dehydrogenase (PDB code: 3LDH). Legend as in Table Ic. Seven out of the 1,000 top ranking results are close to the transformation that best superimposes the surfaces (compare with Tables IIa and IIb). In this table only the closest solutions to the correct one are listed. The rank of each solution (according to Score, where Score is the number of matched pairs between both surfaces) is indicated. The points of the model and target are the selected surface atoms (see text). This comparison required 16 min of CPU time.

TABLE IId. Comparison of the Surfaces of the NAD of the GADH Complex With the Surface of 3LDH*

Model: GADH NAD surface Target: 3LDH surface (40 points) (528 points)

No. of solutions obtained: 5,566 9 correct solutions

Sol. no. ROT TRAN R T Score ITllS

75 3.0 -0.1 -0.6 15.0 36.2 36.0 3.03 53.24 49 1.8 116 3.1 -0.0 -0.6 15.6 36.0 35.6 3.06 52.95 48 1.8 192 3.1 -0.1 -0.6 16.3 36.6 35.5 3.08 53.47 47 1.9 197 2.0 -0.0 -0.4 11.8 38.4 30.8 2.92 50.63 47 1.6 198 2.9 0.1 -0.4 14.0 37.2 33.5 2.97 51.94 47 1.8 265 3.1 -0.1 -0.6 16.1 36.4 34.5 3.07 52.69 46 1.7 277 2.9 0.1 -0.4 13.9 37.4 32.6 2.97 51.51 46 1.8 358 3.0 -0.2 -0.4 11.0 38.5 30.6 2.95 50.43 45 2.0 360 3.0 -0.0 -0.5 15.1 36.6 34.0 3.04 52.15 45 1.6

*The 9 closest solutions to the correct one from the comparison of the surface of the NAD from the alcohol dehydrogenase complex (PDB code: 6ADH) with the surface of lactate dehydrogenase (PDB code: 3LDH). Symbols as in Table IIc. These 9 solutions were found within the top 360 ranking solutions. In this table only the closest solutions to the correct one are listed. The rank of each solution is indicated. The correct solution is the transformation that best superimposes the surfaces (compare with Tables IIa, IIb, and IIc). The points of the model and target are the selected surface atoms (see text). This comparison required 269 sec of CPU time.

these surfaces and the relatively small size of the similar patches indicate that the above representa- tions are unnecessarily large. We reran this compar- ison with smaller sets as input by selecting those atoms of the receptors lying at a distance of 12 A or less from their respective NAD ligands. The result- ing sets contained 108 and 216 atoms for GADH and 3LDH, respectively.

In this example 18,718 transformations were ob- tained. We ranked the solutions obtained by our method as in the globin surface comparison. Table IIc shows the 7 closest solutions to the correct one obtained within the 1,000 top ranking solutions. The transformations of these 7 solutions are very similar

this experiment more transformations were ob- tained than in the comparison of hemoglobin with myoglobin. In addition, the correct solutions ranked lower. This indicates that the shapes of the binding surfaces of hemoglobin and myoglobin are more sim- ilar than those of the dehydrogenases. In addition, the NAD ligands of the dehydrogenases differ more from each other than the heme ligands of the hemo- globins.

This result demonstrates that the similarity be- tween these domains discovered by the backbone comparisons above extends to the surfaces of their active sites. Finding such a similarity indicates that the structure of the 3LDH can geometrically accom-

to the one obtained from the comparisons of the full domains and the motif search described above. In

modate the same ligand as GADH, despite any conformational differences between their crystals.

Page 11: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

288 D. FISCHER ET AL.

Consequently, we next search for the docking orien- tation of the NAD of GADH onto the 3LDH receptor (without its complexed NAD). The transformation needed to be applied to the NAD of GADH in order to dock it into 3LDH should be similar to the one ob- tained for the active sites comparisons.

Comparing the NAD surface of 6ADH to the active site surface of 3LDH

In this example, the surface of the NAD in the GADH complex is compared to the surface of 3LDH. We use the surface representations provided by the SPHGEN program. The surface representation of the NAD from the GADH complex has 40 atoms and the surface of the receptor from the 3LDH complex has 528 points (as in the above comparison). In this case, finding a good superposition of the two surfaces is equivalent to finding the best orientation of the docking of the NAD (from 6ADH) into the active site of 3LDH.

Geometrically, there are several different orienta- tions that superimpose these surfaces well. Table IId shows the 9 solutions closest to the correct one. These solutions are within the first 360 best scores out of a total 5,566 orientations computed. The cor- rect solution can be identified as follows. From the comparisons of the surfaces of GADH and 3LDH above, we know what transformation best superim- poses the two active sites. We could expect that such transformation, if applied to the NAD of GADH, would bring the NAD to a favorable docking orien- tation within the receptor of 3LDH (without the NAD). Indeed, such transformations were obtained.+

Here, we focus only on generating geometric matches between two surfaces. Many of the orienta- tions obtained produce overlap between the two mol- ecules. Such orientations should be discarded as they are physically unacceptable. To this end, a post- matching phase analysis for overlap should follow. In addition, a subsequent analysis of chemical prop- erties, will enable discriminating better the correct orientations from all others. This is out of the scope of this paper.

We also searched for the docking orientation of each NAD onto its corresponding receptor (i.e., 3LDHs NAD with 3LDHs receptor and 6ADHs NAD with 6ADHs receptor). A zero transformation (which is the correct orientation, as both ligands and receptors were taken from their complexed files) was obtained within the 1,000 top ranking scores (not shown). In addition, we searched for the docking ori- entation of the NAD from 3LDH onto the surface of 6ADH. In this case also, our method succeeded in

'Another way to obtain the correct solution is to compute the transformation that best superimposes the NAD of GADH onto the NAD of 3LDH. Such transformation is very similar to the one used above.

finding the correct docking orientation, although with lower scores.

A more interesting experiment is to compare the surface of an unbound alcohol dehydrogenase (e.g., 7ADH) instead of the surface of the complexed 6ADH. Because of the conformational differences between bound and unbound dehydrogenases, the geometrical complementarity of the molecules in this experiment is less precise. The correct solution was found among thousands of orientations (results not shown), but it did not rank within the top scores.

Running times

The running times of the described experiments have been very short. For example, in the globin surface example above, with each surface having about 80 atoms, the comparison took 12.9 CPU sec- onds on a SUN Sparc processor. The search of a 21 atom predefined motif (pattern) in a molecule of about 150 atoms took about 5 CPU seconds.

DISCUSSION Being able to compare molecular surfaces, or sur-

face atoms, has several direct and indirect implica- tions. Molecules possessing similar patches of sur- face are likely to bind the same ligand. Binding of the same ligand may, in turn, suggest that the two receptors play a similar role or have a similar mode of action. Furthermore, it may also point to an evo- lutionary relationship between them. Detection of surface similarity between protein molecules, may thus give us a deeper understanding into biomolec- ular recognition. If, in turn, one of these receptor molecules is known to bind a given ligand at that patch of surface, then one may expect that the other receptor molecules, containing this recurring sur- face motif, would also bind that same ligand at this site. A similar argument holds for searches of the surfaces of ligands. When the ligand is a short pep- tide or a drug, then such searches are equivalent to searches of pharmacophoric patterns. Detection of a pharmacophoric pattern in a set of drug molecules, suggests that the drugs possessing it are recognized by the same receptor. The approach and methodol- ogy presented here apply to surfaces of a whole range of receptor and ligand molecules.

There is an additional practical aspect to searches for molecular surface similarity, that is its applica- bility to the problem of docking. One of the problems encountered in calculations of matching geometries during the process of the docking, is that many po- tential solutions of complementary, matching sur- faces are obtained. Purely geometric matching algo- rithms based on distance invariants, as ours, produce thousands of plausible strictly geometric orientations, from which the correct one must be dis- criminated. Some of the current methods include ad- ditional information to discriminate the correct so- lution in subsequent filtering steps. In order to find

Page 12: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

SURFACE MOTIFS BY A COMPUTER VISION TECHNIQUE 289

A B C

n

D

Fig. 1. (A) Complex A between the receptor A, and the ligand A,. (6) Protein receptor 6, containing a patch of surface similar to the binding site of A, (shaded). (C) Rest superposition of A, onto 6, obtained by comparing the surfaces of 6, and A, (without A,).

the correct, optimal, solutions, two procedures are normally carried out. First, the transformation for each potential, docked structure is calculated, using a least square fitting of the geometrically matched regions. Next, one of the molecules is transformed, i.e., translated and rotated with respect to the other. The two molecules are then scanned for overlapping atoms. If the docked solutions, obtained in the pre- vious step are geometrically unacceptable, they are rejected. The remaining, “good” solutions are then scored for detailed, favorable chemical contacts. Finding the optimal transformation may thus con- stitute a complex, time consuming problem.

Here, we suggest that when surface similarity be- tween corresponding receptors (or, alternatively, ligands) is known, the number of potential transfor- mations from a docking search can be drastically reduced. There are two types of cases where such similarity between surfaces can be useful. Consider the case where the coordinates of a complex, called here A, of the receptor A, and the ligand A, are known (Fig. 1). Searches of a three-dimensional da- tabase detected a second protein receptor, B,, con-

E

(D) A, is transformed using the same transformation that was applied to A, in C. A, is d&ed onto A, as in A and is now posi- tioned in the active site of 6, (E) By removing A,, A, is in a favor- able docking orientation with regards to 6,

taining a patch of surface similar to that of A,. If that patch of surface is a t the binding site, one might expect that B, would recognize and bind to A,. Dock- ing of A, onto B, would produce a large number of potential matchings, and thus, a large number of transformations. Among these transformations, the one similar to that which best superimposes the sur- face of A, onto the surface of B, are the correct ones. This also suggests that in such a case one may not need to carry out the full docking procedure. It suf- fices to transform the atoms of A, by the same trans- formation that best superimposes the two receptor surfaces. Ligand A, would now be positioned in the active site of B,. Clearly, because of surface flexibil- ity, optimization of the conformations of the docked molecules is still required. This paper demonstrates the existence of a simple way to discriminate be- tween the correct docking orientation and thousands of additional solutions using the transformation that superimposes two receptor surfaces.

The discussion above thus suggests that in the cases where the similarity between the surfaces of two receptors (ligands) is known, this knowledge can

Page 13: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

290 D. FISCHER ET AL.

aid in the docking problem. Surface comparison be- tween two receptor proteins may prove particularly useful if there is no global backbone similarity be- tween the two receptors. In such cases, conventional backbone, sequence-dependent, superpositioning techniques cannot aid in finding the common sur- face patch. Surface comparison of two receptors is however a difficult problem. Clearly, a n important factor to consider is the nature of the matched sur- face geometries. So far, we have succeeded in finding the surface similarity between relatively similar re- ceptors.

Our experimentation thus far, suggests that the surface representation given as input to the match- ing algorithm is critical. Connolly2 gives a descrip- tion of the complete molecular surface. Kuntz et aL6 elegantly select invaginated surface patches and in addition reduce the number of descriptors (dots) for the surface. Nevertheless, important surface infor- mation is lost such as surface curvature. If the bind- ing site is f lat , the clustered spheres approach might fail. In addition, the largest cluster may include many spheres corresponding to atoms far away from the active site. A more informative surface repre- sentation would allow a better classification and quantization of the shapes of protein surfaces de- scribing a surface as a collection of patches rather than a collection of 3-D coordinates from selected atoms. In such a case our Computer Vision algo- rithm would also use surface invariants rather than solely distance invariants.

In the docking problem, absence of overlap be- tween the ligand and the receptor is indispensable. Our current results show that overlap analysis elim- inates many plausible generated orientations. On the other hand, in the surface comparison problem overlap between the matched molecules may be de- sirable. Thus, in this case, the problem of discimi- nating between all the orientations generated still remains. Nevertheless, our method succeeds in find- ing the best superposition of two surfaces, where the surface similarity is relatively large. For more dis- similar surfaces, our method is able to discover the subset of atoms that forms a common shape, but thousands of different subsets are also produced. Some of these are random “matches” obtained be- cause proteins are relatively dense in space. Dis- criminating between all the results to identify the meaningful matches is a difficult task. Still, a s in docking, inclusion of the two components, surface and chemical properties both in the input sets and in the matching algorithm itself, may dramatically re- duce the number of generated orientations.

CONCLUSION We have presented a n application of a method for

protein structural comparison that uses strictly the 3-D atomic coordinates. Our method searches for the

best geometric match between given sets of points. The coordinates can be those of C, or C , atoms, sur- face atoms or in general any set of points in space. Using C,, atoms only implies searching for the best 3-D geometric match between the backbones of the structures without constraining i t to a sequential match (i.e., the sequential order of the chain need not be conserved). The ability to match a n uncon- nected set of atoms allows also protein surface com- parison and a search for the correct orientation of a ligand surface onto a receptor surface. Each type of comparison focuses on different aspects of the pro- tein structure and biomolecular recognition. These aspects complement each other and contribute to a better understanding of protein structure and func- tion.

ACKNOWLEDGMENTS We would like to thank Drs. D. Covell, R. Jerni-

gan, J . Maize], and L. Young for discussions and suggestions. The research of R. Nussinov has been sponsored by the National Cancer Institute, DHHS, under Contract 1-CO-74102 with Program Re- sources, Inc. The contents of this publication do not necessarily reflect the views or policies of the DHHS, nor does mention of trade names, commeri- cal products, or organizations imply endorsement by the U.S. Government. The research of H.J. Wolfson has been supported in part by Grant 83-00481 from the U.S.-Israel Binational Science Foundation (BSF), Jerusalem, Israel. This work formed part of the Ph.D. Thesis of D. Fischer, University of Tel Aviv.

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

REFERENCES Mitchel, E.M., Artymiuk, P.J., Rice, D.W., Willet, P. Use of techniques derived from graph theory to compare second- ary structure motifs in proteins. J. Mol. Biol. 212:151-166, 1989. Connolly, M.L. Analytical molecular surface calculation. J. Appl. Cryst. 16:548-558, 1983. Connolly, M.L. Solvent-accesible surfaces of protein and nucleic acids. Science 221:709-713, 1983. Connolly, M.L. Shape complementarity a t the hemoglobin al@, subunit interface. Biopolymers 25:1229-1247, 1986. Zachmann, C.D., Heiden, W., Schlenkrich, M., Brickmann J. Topological analysis of complex molecular surfaces. J. Comp. Chem. 13(1):76-84, 1992. Kuntz, I.D., Blaney, J.M., Oatley, S.J., Langridge, R., Fer- rin, T.E. A geometric approach to macromolecule-ligand interactions. J. Mol. Biol. 161269-288, 1982. MacLachan, A.D. Gene duplications in the structural ev- olution of chymotrypsin. J. Mol. Biol. 128:49-79, 1979. Matthews, B.W., Rossmann, M.G. Comparison of protein structures. Methods Enzymol. 115:397-420, 1985. Taylor, W.R., Orengo, C.A. Protein structure alignment. J . Mol. Biol. 208:l-22, 1989. Alexandrov, N.N., Takahashi, K., Go, N. Common spatial arrangements of backbone fragments in homologous and non-homologous proteins. J. Mol. Biol. 225:5-9, 1992. Vriend, G., Sander, C. Detection of common three-dimen- sional substructures in proteins. Proteins 1152-58, 1991. Artymiuk, P.J., Grindley, H.M., Park, J .E. , Rice, D.W., Willet, P. Three-dimensional structural resemblance be- tween leucine aminopeptidase and carboxypeptidase A re-

Page 14: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

SURFACE MOTIFS BY A COMPUTER VISION TECHNIQUE 291

vealed by graph-theoretical techniques. FEBS Lett. 303: 48-52, 1992.

13. Lesk, A.M. Detection of 3D patterns of atoms in chemical structures: Selection of interatomic distance screens. J. Mol. Graphics 4:12-20, 1986.

14. Brint, A.T., Daview, H.M., Mitchel, E.M., Willet, P. Rapid geometric searching in protein structures. J . Mol. Graph- ics 7:48-53, 1989.

15. Lamdan, Y., Schwartz, J.T., Wolfson, H.J. On recognition of 3-D objects from 2-D images. Proc. of IEEE Int. Conf. Robotics and Automation, Philadelphia, PA, April 1988, pp. 1407-1413.

16. Fischer, D., Bachar, O., Nussinov, R., Wolfson, H.J. An efficient automated computer vision based technique for detection of three dimensional structural motifs in pro- teins. J. Biomolec. Struct. Dyn. 9(4):769-789, 1992.

17. Bachar, O., Fischer, D., Nussinov, R., Wolfson, H.J. A com- puter vision based technique for 3-D sequence independent structural comparison of proteins. Protein Eng., in press 1993.

18. Fischer, D., Wolfson, H.H., Nussinov, R. Spatial, sequence- order-independent structural comparison of u/p proteins: Evolutionary implications. 1993. Submitted.

19. Richards, F.M. Areas, volumes, packing and protein struc- ture. Annu. Rev. Biophys. Bioeng. 6:151-176, 1977.

20. Shoichet, B.K., Kuntz, I.D. Protein docking and comple- mentarity. J . Mol. Biol. 221:327-346, 1991.

21. Shoichet, B.K., Bodian, D.L., Kuntz, I.D. Molecular dock- ing using shape descriptors. J . Comp. Chem. 13:380-397, 1992.

22. Lesk, A.M., Chothia, C. How different amino acids deter- mined similar protein structures: The structure and evo- lutionary dynamics of the globins. J. Mol. Biol. 136:225- 270, 1980.

23. Bernstein, F.C., et al. The Protein Data Bank: A computer- based archival file for macromolecular structures. J . Mol. Biol. 112535442, 1977.

24. Ohlsson, I., Nordstrom, B., Branden, C.I. Structural and functional similarities within the coenzyme binding do- mains of dehydrogenases. J . Mol. Biol. 89:339-354, 1974.

25. Rossmann, M.G., Argos, P. Chemical and biological evolu- tion of a nucleotide binding protein. Nature (London) 250: 194-199, 1974.

26. Lamdan, Y., Wolfson, H.J. Geometric Hashing: A general and efficient model-based recognition scheme. Proc. IEEE Int. Conf. Computer Vision, Tampa, Florida, December 1988, pp 238-249.

27. Lamdan, Y., Schwartz, J.T., Wolfson, H.J. Afine invariant model-based object recognition. IEEE Trans. Robotics Au- tomation 6(5):578-589, 1990.

28. Nussinov, R., Wolfson, H.J. Efficient detection of three- dimensional motifs in biological macromolecules by com- puter vision techniques. Proc. Natl. Acad. Sci. U.S.A. 88: 10495-10499, 1991.

29. Schwartz, J.T., Sharir, M. Identification of partially ob- scured objects in two dimensions by matching of noisy ‘characteristic curves’. Int. J . Robotics Res. 6(2):29-44, 1987.

APPENDIX: THE GEOMETRIC HASHING PARADIGM

The Geometric Hashing paradigm for model based object recognition was introduced by Lamdan et al. 15,2627 Efficient algorithms were developed for recognition of rigid objects both in 2-D and in 3-D. They are especially suitable for detection of a partial match between a target and objects belonging to a large database. The adaptation of the technique for 3-D molecule matching has been presented.l7sZ8

The technique consists of two major stages-pre- processing and recognition. In the preprocessing stage the model is redundantly represented in a transformation invariant fashion and memorized in

a hash-table. The representation of the preprocess- ing allows efficient recognition of the objects in the target. This recognition is not affected by the trans- formation of the objects and by their partial occlu- sion (equivalent to a partial match of molecules). Below we sketch our technique for partial matching of molecules in 3-D, which may undergo a rigid mo- tion (rotation and translation).

Preprocessing The model molecule is preprocessed separately.

For’ a given molecule, one encodes the coordinates of the atoms, with respect to a reference set based on a pair of atoms. Specifially, pick an ordered pair of atoms (which is subsequently nicknamed “basis”), and for each atom of the molecule compute the dis- tance from this atom to each of the basis atoms. Use the lengths of the sides of the resulting triangle as an address to the hash-table and store in the appro- priate entry the basis atoms and the third atom. This procedure is done for all the molecule atoms (per basis), and for all possible bases (pairs of at- oms). In our implementation we have introduced minimal and maximal distance constraints on the lengths of the triangle sides, so not all triangles have been considered in the preprocessing stage. Note that the representation of an atom in a given basis is not unique, since the triangle sides define a circle in space, but this ambiguity can be easily re- moved in subsequent verification stages. The pre- processing is done off-line.

Recognition In the on-line recognition stage we are faced with

a target molecule, and want to detect substructures of this molecule which match substructures of the model molecule modulo rotation and translation. The following procedure is repeated for each pair of target molecule atoms. Pick a pair of atoms on the target molecules (basis) and consider the triangles that it creates with the other molecule atoms (given that, as in the preprocessing, the triangle complies with the side distance constraints). The lengths of the triangle sides define an address to the hash-ta- ble, and we count a match for each model basis which appears in that hash-table entry. If a partic- ular model basis scores a large number of matches under the current target basis, it implies a consid- erable overall match between a substructure of the model with our target molecule, when the appropri- ate bases are transformed to each other. One can take all the atoms that have participated in that match, eliminate the inconsistent entries which were generated due to the nonuniqueness of the third triangle point location in space, and compute a more accurate spatial transformation using a least squares fitting procedure.29 Note that due to the re- dundant and transformation invariant encoding of

Page 15: Surface motifs by a computer vision technique: Searches, detection, and implications for protein–ligand recognition

292 D. FISCHER ET AL.

each atom into the hash-table we are able to detect number of atoms in the target (usually larger than partially matching structures regardless of their ro- the model). Nevertheless, because of the distance tation and translation. constraints described above, the actual running

The complexity of this step is Oh3), where n is the time is below Oh3) .