A geometry-based suite of moleculardocking processes

19
J. MoL BioL (1995) 248,.459-477 JMB A Geometry-based Suite of Molecular Docking Processes Daniel Fischer 1,2, Shuo Liang Lin 3, Haim L. Wolfson 1 and Ruth Nussinov 2,3. JComputer Science Department, School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel 2Sackler Institute of Molecular Medicine, Tel Aviv University Tel Aviv 69978, Israel 3Laboratory of Mathematical Biology, NCI-FCRF, Bldg 469, Rm 151, Frederick, MD 21702, U.S.A. *Corresponding author We have developed a geometry-based suite of processes for molecular docking. The suite consists of a molecular surface representation, a docking algorithm, and a surface inter-penetration and contact filter. The surface representation is composed of a sparse set of critical points (with their associated normals) positioned at the face centers of the molecular surface, providing a concise yet representative set. The docking algorithm is based on the Geometric Hashing technique, which indexes the critical points with their normals in a transformation invariant fashion preserving the multi-element geometric constraints. The inter-penetration and surface contact filter features a three-layer scoring system, through which docked models with high contact area and low clashes are funneled. This suite of processes enables a pipelined operation of molecular docking with high efficacy Accurate and fast docking has been achieved with a rich collection of complexes and unbound molecules, including protein-protein and protein-small molecule associations. An energy evaluation routine assesses the intermolecular interactions of the funneled models obtained from the docking of the bound molecules by pairwise van der Waals and Coulombic potentials. Applications of this routine demonstrate the goodness of the high scoring, geometrically docked conformations of the bound crystal complexes. Keywords: molecular docking; docking of bound and unbound complexes; protein-protein docking; protein-drug docking; molecular surface matching Introduction Geometric complementarity is a central issue in biomolecular recognition. Tightly matched surfaces between bound molecules close an appreciable area of interface to the medium, gaining stability for the complex from hydrophobic effects. The complemen- tary geometry reflects the effect of van der Waals interactions, which are very sharp at short distances. A tight, fairly large interface constitutes a necessary condition of a stable complex, upon which a screening of possible conformations would rapidly converge to those feasible for further physical, chemical and biological examinations. Investigating molecular recognition in such a relayed mode can be far more efficient than ab initio methods for larger molecular systems. Docking methods have emerged during the last few years that are able to reproduce near-native conformations on the basis of geometri- Present address: D. Fischer, Laboratory of Mathematical Biology, NCI-FCRF, Bldg 469, Rm 151, Frederick, MD 21702, U.S.A. cal complementarity (Fischer et al., 1993; Lin et al., 1994; Norel et al., 1994a,b; Connoll}¢ 1986; Cherfils et al., 1991; Jiang & Kim, 1991; Shoichet & Kuntz, 1991; Wang, 1991; Bacon & Moult, 1992; Katchalski- Katzir et al., 1992; Walls & Sternberg, 1992). Geometric docking is exceedingly complex, due to the fact that computational costs increase exponen- tially with the degrees of freedom of the molecular system. With hundreds to thousands of atoms to move, the number of possible conformations is astronomical. Any practical docking method has to apply serious constraints to the system. Rigid body approximation that freezes all the" degrees of freedom but three translations and three rotations is currently the choice of most of the general docking methods. For proteins, it has been argued that the approximation is justified by the similarity of the crystallographic structures between the bound and the unbound proteins (Janin & Chothia, 1990; Cherfils & Janin, 1993), while cases exist where more substantial conformational changes have been observed that may need a different methodology Even when molecules are treated as rigid bodies, a 0022-2836/95/170459-19 $08.00/0 r~ 1995 Academic Press Limited

Transcript of A geometry-based suite of moleculardocking processes

Page 1: A geometry-based suite of moleculardocking processes

J. MoL BioL (1995) 248,.459-477

J M B A Geometry-based Suite of Molecular Docking Processes

Daniel Fischer 1,2, Shuo Liang Lin 3, Haim L. Wolfson 1 and Ruth Nussinov 2,3.

JComputer Science Department, School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel

2Sackler Institute of Molecular Medicine, Tel Aviv University Tel Aviv 69978, Israel

3Laboratory of Mathematical Biology, NCI-FCRF, Bldg 469, Rm 151, Frederick, MD 21702, U.S.A.

*Corresponding author

We have developed a geometry-based suite of processes for molecular docking. The suite consists of a molecular surface representation, a docking algorithm, and a surface inter-penetration and contact filter. The surface representation is composed of a sparse set of critical points (with their associated normals) positioned at the face centers of the molecular surface, providing a concise yet representative set. The docking algorithm is based on the Geometric Hashing technique, which indexes the critical points with their normals in a transformation invariant fashion preserving the multi-element geometric constraints. The inter-penetration and surface contact filter features a three-layer scoring system, through which docked models with high contact area and low clashes are funneled. This suite of processes enables a pipelined operation of molecular docking with high efficacy Accurate and fast docking has been achieved with a rich collection of complexes and unbound molecules, including protein-protein and protein-small molecule associations. An energy evaluation routine assesses the intermolecular interactions of the funneled models obtained from the docking of the bound molecules by pairwise van der Waals and Coulombic potentials. Applications of this routine demonstrate the goodness of the high scoring, geometrically docked conformations of the bound crystal complexes.

Keywords: molecular docking; docking of bound and unbound complexes; protein-protein docking; protein-drug docking; molecular surface matching

Introduction Geometric complementarity is a central issue in

biomolecular recognition. Tightly matched surfaces between bound molecules close an appreciable area of interface to the medium, gaining stability for the complex from hydrophobic effects. The complemen- tary geometry reflects the effect of van der Waals interactions, which are very sharp at short distances. A tight, fairly large interface constitutes a necessary condition of a stable complex, upon which a screening of possible conformations would rapidly converge to those feasible for further physical, chemical and biological examinations. Investigating molecular recognition in such a relayed mode can be far more efficient than ab initio methods for larger molecular systems. Docking methods have emerged during the last few years that are able to reproduce near-native conformations on the basis of geometri-

Present address: D. Fischer, Laboratory of Mathematical Biology, NCI-FCRF, Bldg 469, Rm 151, Frederick, MD 21702, U.S.A.

cal complementarity (Fischer et al., 1993; Lin et al., 1994; Norel et al., 1994a,b; Connoll}¢ 1986; Cherfils et al., 1991; Jiang & Kim, 1991; Shoichet & Kuntz, 1991; Wang, 1991; Bacon & Moult, 1992; Katchalski- Katzir et al., 1992; Walls & Sternberg, 1992).

Geometric docking is exceedingly complex, due to the fact that computational costs increase exponen- tially with the degrees of freedom of the molecular system. With hundreds to thousands of atoms to move, the number of possible conformations is astronomical. Any practical docking method has to apply serious constraints to the system. Rigid body approximation that freezes all the" degrees of freedom but three translations and three rotations is currently the choice of most of the general docking methods. For proteins, it has been argued that the approximation is justified by the similarity of the crystallographic structures between the bound and the unbound proteins (Janin & Chothia, 1990; Cherfils & Janin, 1993), while cases exist where more substantial conformational changes have been observed that may need a different methodology Even when molecules are treated as rigid bodies, a

0022-2836/95/170459-19 $08.00/0 r~ 1995 Academic Press Limited

Page 2: A geometry-based suite of moleculardocking processes

I

460 A Geometry-based Suite of Molecular Docking Processes

fair search along each of the six degrees of freedom can require hundreds of steps, making the sample size to reach billions easily: By focusing on specific spots, the difficulty of the problem can be reduced. However, this requires sufficient knowledge of the binding.

Recently we have developed a geometrically based suite of processes for molecular docking. This approach fulfils the following goals: (1) no predefinition of the binding site. The sole infor- mation required for carrying out the docking is the atomic coordinates of the molecules. Binding-site information is not a prerequisite. However, if this information is available, it can be naturally incorporated, resulting in a significant increase in performance. (2) Docking of receptors and ligands of variable sizes. Docking is successfully carried out for both small and large molecule ligands onto large (protein) receptors. (3) High speed; completes in minutes on a workstation. The technique is fast, with matching times for small molecules (e.g. methotrex- ate and HIV-1 protease inhibitors) under ten minutes and for large molecules (e.g. trypsin inhibitors and HIV-1 protease subunits) under 60 minutes. (4) Low root-mean-square deviations (r.m.s.d.) of the correct solution. The quality of the solutions is very high. In the experiments performed comparison of the "best" docked conformation with the native complexing conformation resulted in a very close fit. The all-atom r.m.s.d, obtained for most complexes are under I A, with only one of the immunoglobulins for which it is of the order of 2 A. (5) The number of potential solutions is small. The number of potential, geometrically correct, solutions is manageable, before any further filtering or optimization is applied. The number varies from several tens for the smaller molecules to several thousands for the larger ones. (6) Solution filtering is available by a surface contact scoring function. The score is based on the receptor-ligand contact surfaces and body clashes. The scoring ranks the "correct" solutions high among all geometrically compatible solutions, allowing the funneling of these solutions for further processing. (7) The correct solutions are low-energy conformations. The docked conformations obtained in the test cases of known complexes are analyzed energetically by the van der Waals and electrostatic interactions between the receptor and the ligand. The native conformations possess a low, negative energy Solutions with energies close to the native ones are obtained.

The combination of these advantages allows picking the geometrically few top binding confor- mations for further detailed studies. We are able to achieve this level of performance owing to two major ingredients: a surface representation that describes the surface both precisely and economically and the docking scheme that incorporates sophisticated Computer Vision techniques. The surface represen- tation is a sparse set of points and their associated normals. The points are positioned at critical locations of a molecular surface (Lin et al., 1994). We have also adapted a new docking algorithm based on

the Geometric Hashing technique developed orig- inally for Computer Vision applications. We have realized that there is a remarkable similarity in the type of problems faced in Computer Vision and in molecular recognition (Nussinov & Wolfson, 1991; Bachar et al., 1993; Norel et al., 1994b; Fischer et al., 1994). Employing these Computer Vision techniques for molecular shape matching with our surface representation, we are able to perform a high-speed search to achieve complete and accurate docking. In addition, a surface contact scoring filter has been developed (Norel et al., 1994b). These three components complete a suite of geometric docking processes, which forms a pipeline of operations to extract plausible binding conformations.

Below we outline our methodology and its rationale. We illustrate the results obtained for the 19 bound and three unbound complexes that we have docked. These complexes span different receptor and ligand classes and sizes. In all cases accurate results were achieved in short times. This method is a rigid-body approach, and as such does not take into account molecular flexibility Ways to partially account for the flexibility within the framework of the method presented here will be outlined.

Results

We present results of docking experiments on 19 complexes and three unbound ligand and receptor pairs. We first present the results of our docking experiments on the 19 test case complexes, where the full surfaces of both the receptor and the ligand are used. Next we present the results of docking where the receptor surface input is restricted to the active site of the same complexes. Subsequently we assess the quality of the solutions obtained by the docking of the full surfaces in the test complexes using an energy evaluation routine. Lastly we present the results of docking three unbound ligand-receptor pairs.

The docking of receptor-ligand complexes

Table I lists the molecules we have included in this study and the number of their surface atoms. The ligand and receptor in each complex are first separated and no information about the original, native docking orientation is taken into account in the docking procedure. The structures of the docked complexes have been determined crystallographi- cally or by modeling. The complexes used in the docking calculations have been divided into four groups, according to the sizes of both the ligands and the receptors. The first group includes complexes having ligands with a small number of surface atoms (at most 115, with the majority far less than that). The second group includes large receptors (with the number of surface atoms between 477 and 761), and ligands having a larger number of surface atoms than in the first group (166 to 264). The third group contains medium-sized receptors and ligands (with

Page 3: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 461

the numbe r of surface atoms ranging be tween 393 and 469, and be tween 398 and 419, respectively). The fourth group contains the immunoglobulin- lysozyme complexes.

Table 2 lists the results obtained in the docking of the 19 complexes. Table 2A displays the results achieved in the docking of the receptor surface with the complete ligand surface. For the first 16 complexes, the entire molecular surfaces were considered, with absolutely no additional biochemi- cal information taken into account. For the latter three, i.e. the immunoglobul in- lysozyme complexes, only the epitope regions of the immunoglobulins were employed, with the entire surfaces of the lysozymes. The first column specifies the Protein Data Bank (PDB; Bernstein et al., 1977) code name of the complex. The next three columns note the best r.m.s.d, obtained in the docking runs, and the rotation and translation which yield it. This r.m.s.d. is computed by comparing the ligand in its native docking orientation with the proposed ligand orientation obtained from the program. The expected r.m.s.d., in the next column, is that obtained when comparing the ligand in its native docking orientation with the orientation of the ligand obtained after applying a transformation, which is the "ideal" transformation one could obtain using our method with the given critical points and normals. This transformation is computed as follows. First, using the native complex, we compute all critical cap-pit point pairs that are less than 2.0 A apart and their normals are less than one radian from antiparallel

(see Appendix II). Next, we find the transformation that gives the least-squares distance w h en super im- posing the caps onto the pits, according to the pairs found. This provides our "bes t" expected transform- ation and its r.m.s.d. Note that this r.m.s.d, is a measure of the accuracy of the critical points used for docking, and is used only as a comparative measure of the performance of the method. It cannot be known a priori in a real-life unbound docking experiment. The overall numbers of solutions for each of the complexes, obtained from the docking routine, are listed in the sixth column, and the matching times (in minutes) are noted in the seventh. Inspection of these results reveals their high qualit~ The r.m.s.d, obtained for all complexes are very low. Indeed, for all non-immunoglobul in- lysozyme com- plexes, the best r.m.s.d, achieved is less than I A, and for the latter it is less than 2 ~ (1.68/k). Comparison of these r.m.s.d, with the expected values demon- strates them to be quite close, suggesting that we have reached nearly the optimal docking potentially feasible with our critical cap-pit point surface representation. The number of solutions obtained is manageable and, as seen in Table 2B, this number is considerably reduced by our surface contact scoring routine (see below). These results were obtained in very short matching times, all under 60 minutes on a Silicon Graphics work-station. The last three columns in Table 2A note the number of ligand caps and receptor pits actually used in the matching, and the min imum number of receptor-ligand point pairs required by the docking algorithm to register a

Table 1

The complexes and their number of service atoms

Complex Receptor Ligand

No. surface atoms No. face centers

Receptor Ligand Receptor Ligand lcpk Protein kinase Protein kinase inhibitor 939 115 5960 3dfr Dihydrofolate reductase Methotrexate 576 32 3691 4mbn Metmyoglobin Heine 479 44 3104 4phv HIV-1 protease Inhibitor 639 44 3985 2igf IgG1 Fab fragment Myohemerythrin 69-87 652 50 4113

4c pa Carboxypeptidase As Carboxypeptidase A inhib. 761 166 4798 ltgs Trypsinogen Pancreatic trypsin inhibitor 594 231 3778 lcho c(-Chymotrypsin Ovomucoid 3rd domain 620 226 3868 2ptc 13-Trypsin Pancreatic trypsin inhibitor 579 232 3680 I tec Thermitase Eglin-C 654 250 3991 4sgb Serine protease B Potato inhibitor PCI-1 477 214 2952 2 s e c Subtilisin Carlsberg Eglin-C 616 264 3843 4tpi Trypsinogen Pancreatic trypsin inhibitor 601 233 3839

2mbh Hemoglobin 13-subunit Hemoglobin c~-subunit 469 419 2969 4hvp HIV-1 protease chain B HIV-1 protease chain A 393 386 2539 4phv2' HIV-1 protease chain B HIV-1 protease chain A 397 398 2564

lfdl IgG1 Fab fragment Lysozyme 646 421 4004 2hfl IgG1 Fab fragment Lysozyme 613 408 4039 3hfm IgG1 Fab fragment Lysozyme 621 410 3874

712 214 284 284 301

1054 1552 1409 1471 1574 1317 1666 1479

2665 2544 2532

2634 2604 2598

PDB code (Bernstein et al., 1977) of the complexes is listed in the leftmost column, followed by the description of the receptor and the ligand. The surface atoms were identified during the generation of the surface critical points. The number of unpruned critical points generated (face centers) are listed in the last 2 columns. Table 2A lists the actual number of pruned critical points used for docking.

4phv2 is the same complex as 4phv.

Page 4: A geometry-based suite of moleculardocking processes

i

462 A Geometry-based Suite of Molecular Docking Processes

match . N o t e that w e have app l i ed s l ight ly d i f fe ren t p r u n i n g th resho lds for s o m e of the example s (see Table 2A). We d i scuss this va r i ab i l i t y a n d h o w our p r u n i n g cr i ter ia can be s t a n d a r d i z e d in the Discuss ion .

Table 2B d isp lays the resu l t s ob t a ined by the next rou t ine in ou r g e o m e t r i c su i te of dock ing processes , w h i c h e v a l u a t e s the g e o m e t r i c g o o d n e s s of the d o c k e d solut ions . This scor ing rou t ine rejects so lu t ions hav ing la rge over laps b e t w e e n the r ecep to r and the l igand, and credi t s those g iv ing rise to l a rger m o l e c u l a r interfaces. The resu l t s are a l a rge

r e d u c t i o n in the n u m b e r of po ten t i a l d o c k e d confo rmat ions , and a r ank ing of the r e m a i n i n g so lu t ions ref lec t ing bo th the d e g r e e s of b e n i g n in ter face coup l ing and the u n f a v o r a b l e b o d y pene t ra t ion . The second c o l u m n lists the n u m b e r of so lu t ions that have not b e e n re jec ted by the i n t e r -pene t r a t i on and surface contact scor ing p ro- cedure . The C P U m i n u t e s r e q u i r e d for f i l ter ing the so lu t ions p r o d u c e d in the dock ing runs , w h o s e n u m b e r s are no ted in Table 2A, are l is ted in the th i rd co lumn . In the next co lumns , the Table d i sp lays the so lu t ion that ach ieved the bes t r .m.s.d. (ob ta ined by

Table 2

Results of the full surface docking for the bound molecules

A. Results from the docking algorithm: fldl receptor surface versus fldl ligand surface Best solution

Expected Time r.m.s.d. Rot . T r a n s . r.m.s.d. No. of match No. ligand

Complex (A) (rad) (,4.) (,~.) matches (min) caps No. receptor Minimum

pits no. pairs

lcpk 0.51 0.06 1.10 0.34 17,342 19.6 119. 3dfr 0.49 0.04 1.63 0.35 10,299 10.5 43. 4mbn 0.39 0.07 0.60 0.27 37,498 7.6 59. 4phv 0.63 0.13 3.63 0.45 9595 6.6 51. 2igP 0.72 0.08 3.04 0.63 82,154 48.7 85 '.

4cpa 0.82 0.08 1.98 0.72 63,066 16.3 148(a) 1 tgs 0.58 0.05 1.61 0.47 26,644 7.9 144 lcho 0.43 0.02 0.19 0.71 114,508 27.3 167(b) 2ptc 0.44 0.05 3.16 0.47 63,408 16.4 173(b) ltec 0.69 0.04 0.86 0.93 242,325 52.0 214(a) 4sgb 0.59 0.07 1.58 0.54 76,467 15.1 163(b) 2sec 0.57 0.02 0.26 0.81 229,804 44.3 185(b) 4tpi 0.36 0.03 2.65 0.38 31,621 9.3 139

2mhb 0.79 0.04 0.14 0.42 4284 4.6 249 4hvp 0.92 0.07 1.52 0.54 5761 4.0 227 4phv2 0.67 0.03 0.76 0.60 4982 4.4 230

lfdP 1.68 0.07 1.59 1.15 3340 9.8 439. 2hfl b 0.59 0.04 2.47 0.77 26,598 18.7 437. 3hfm b 1.39 0.04 3.20 1.11 56,995 40.6 438.

B. Results from the geometric scoring procedure

544 10 596. 10 215 10 403 10 620(c) 10

436 7 366 7 449(c) 7 364 7 396 7 296 7 479(c) 7 382 7

273 10 238 10 247 10

53 13 59 13 79 13

Best solution Expected sol. Best rank Time

No. of score r.m.s.d. PDB r.m.s.d. Complex sols (min) (/~,) Rank rank (A) Rank

r.m.s.d. (A.) Rank

lcpk 37 0.7 0.51 7 7 0.34 7 3d fr 1140 0.3 0.49 22 294 0.35 293 4mbn 358 0.9 0.39 2 15 0.27 9 4phv' 6504 0.4 0.63 3082 4845 0.45 1745 2igf 2668 2.6 0.72 19 1 0.63 14

4cpa 3249 6.3 0.82 281 201 0.72 45 I tgs 237 3.1 0.58 2 2 0.47 2 lcho 1721 10.9 0.43 60 40 0.71 36 2ptc 1219 8.2 0.44 27 20 0.47 10 ltec 2154 29.9 0.69 134 20 0.93 16 4sgb 827 7.3 0.59 10 10 0.54 10 2sec 2620 37.8 0.57 20 8 0.81 8 4tpi 739 4.0 0.36 20 2 0.38 2

2mhb 32 1.1 0.79 1 1 0.42 1 4hvp 4 0.7 0.92 1 1 0.54 1 4phv2 24 0.7 0.67 4 2 0.60 3

lfdl 285 2.2 1.69 28 44 1.15 33 2hfl 661 5.6 0.59 105 89 0.77 22 3hfm 1483 12.2 1.39 1146 30 1.11 36

1.20 1 0.65 7 0.48 1 1.25 327 1.03 1

1.51 147 0.72 1 0.80 6 1.15 3 0.69 134 1.09 6 1.06 8 0.91 2

0.79 1 0.92 1 0.75 1

1.97 22 2.01 13 1.97 30

continued

Page 5: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 463

T a b l e 2 continued

C. Results from the docking and scoring algorithm: receptor active site surface versus full ligand surface Best rank

Best Time Interface r.m.s.d. No. of match No. of r.m.s.d, receptor

Complex (£) matches (min) sols (£) Rank pits

1 cpk 0.51 1124 2.2 37 1.20 1 74 3dfr 0.49 867 0.7 202 0.65 3 60 4mbn 0.39 1963 0.5 360 0.48 1 15 4phv d 0.63 615 0.7 736 1.25 128 49 2igf" 0.72 3856 2.5 149 0.75 1 d41

4cpa 0.82 4931 2.1 730 1.51 28 49 I tgs 0.58 2220 1.0 48 0.72 1 63 lcho 0.43 15,232 4.2 389 0.80 4 69 2ptc 0.44 5748 2.0 147 1.15 4 55 I tec 0.69 16,057 4.9 223 0.69 7 62 4sgb 0.59 17,060 4.2 645 1.09 2 68 2sec 0.57 13,555 4.6 376 1.06 3 71 4tpi 0.36 6473 1.7 202 0.91 2 62

2mhb 0.79 1053 1.1 17 0.79 1 58 4hvp 0.92 327 1.1 2 0.92 1 101 4phv2 0.67 579 1.2 25 0.75 1 100

1 fdP 1.69 364 1.6 67 1.97 8 93 2hfl I 0.59 1138 2.1 101 2.01 1 100 3hfm f 1.39 15,439 9.7 313 1.97 6 124

A, The results obtained in the docking of the full receptor with the full ligand surfaces (except for the 3 immunoglobulin examples, see below). The 1st column lists the PDB code names. The best solution obtained by our program (in the sense of having the lowest r.m.s.d. with regards to the native orientation) is described in the next 3 columns. Its r.m.s.d, rotation and translation are shown. The 5th column lists the r.m.s.d, of the expected solutions. The expected solutions are computed from the pruned critical points of the ligands and of the receptors in their original positions (see the text). The total number of (seed) matches found by the Geometric Hashing docking algorithm for each complex is given in the 6th column, along with the matching times (in minutes, on a Silicon Graphics workstation). The matching times do not include the times required for the scoring, which are given in B. The numbers of the critical points, describing the surfaces are given in the next 2 columns. The ligand critical points (caps) have been pruned as follows. For caps, we used the default pruning criteria, except where indicated otherwise. A stop indicates that all the caps were used (no pruning), this holds for small ligands and for the immunoglobulins; (a) indicates that all caps representing areas less than 1% of the probe ball are rejected; (b) indicates that all caps with areas less than 2% of the probe ball, are dismissed. The receptor critical points (pits) have been pruned as follows. All pits were pruned with default values (i.e. those having area less than 1% or greater than 10% of the probe ball area are rejected), except for those marked with (c). There, the corresponding values are 0.5% and 12.0%. For the pits, a stop indicates that all the fused pits were used (i.e. no area pruning). No further pruning was applied. The minimal number of matched pairs required to consider a seed match is noted in the last column.

B, The number of solutions left after the surface contact scoring routine (whose score is at least 2 /3 of the maximum score obtained) is given in the 2nd column. The scoring time (in minutes) is given in the 3rd column. The lowest r.m.s.d, solution obtained by the matching algorithm is given in the 4th column. These solutions are the same as those listed in A. The ranks they are given by the scoring algorithm are shown. It should be noted that since we do not cluster the results by their transformations, many solutions above the best one, essentially represent the same docked conformation. The ranks given by the scoring routine to the native complexes are listed in the 6th column. The 7th and 8th columns list the r.m.s.d, and rank of the expected solutions. These are computed from the critical points of the ligands and of the receptors in their original native positions (same as column 5 in A; see the text). The rankings of the PDB complexes and of the "expected solutions" are given in order to be able to gauge the performance of our matching algorithm. The r.m.s.d, and the rank of the first solution having an r.m.s.d, below 1.5 £ are shown in the last 2 columns (for the immunoglobulins, the first solution under 2.0 £ is shown). The "best" solution, and the highest ranking one having an r.m.s.d, less than 1.5 £ (or 2.0 £) represent similar docking conformations, with a small variation.

C, The results obtained in the docking of the surface of the ligand with the surface of the active site of the receptor. The best solutions obtained here are the same as those obtained in the docking of the full receptor surfaces. The differences are, as expected, in the number of seed matches obtained by the Geometric Hashing algorithm, the number of potential solutions, the matching times, and in the ranking. Here the match time includes both the time of matching and of scoring (in minutes). It is noteworthy that the differences in times between C and A indicate a roughly linear relation with the size of the receptor. Solutions whose score is less than 2 /3 of the maximum score obtained are rejected. The number of solutions remaining after the scoring routine is given in the 5th column. The r.m.s.d, and the rank of the first solution having an r.m.s.d, less than 1.5 A (2.0 A for the immunoglobulins) are listed in the 6th and 7th columns. As is the case for the full receptor surface, there are solutions with slightly larger r.m.s.d, than the best one, which rank higher. The receptor surface points used in the matching have been picked from the full sets by including all points at most 5 £ from any point in the interface. The interface is defined as described for A. The whole ligand surface was used, except for the 3 irnmunoglobulins as noted below. All the examples were run using the same thresholds and parameters as those used in A and B.

' For 2igf, 'pits' were used for the ligand, and 'caps' for the receptor. b For the immunoglobulin-lysozyme complexes, only the epitope of the immunoglobulin was used. The epitope was defined as all the

critical points at most 5.0 £ from any point in the interface. The interface is defined as all the pairs of points (ligand critical point, receptor critical point) for which the distance between the ligand critical point and the receptor critical point in the native orientation is at most 4.0£.

c For 4phv, due to the large number of "high contact" solutions obtained by our scoring procedure, we retain all solutions with rank at least as high as that given by the PDB orientation.

" For 4phv, due to the large number of high contact solutions obtained by our scoring procedure we retain all solutions with rank at least as high as that given by the PDB orientation.

' For 2igf pits were used for the ligand and caps for the receptor (as in A). f For the immunoglobulin-lysozyme complexes, as only the epitope of the immunoglobulins was used previousl)4 here we show the

results of comparing the immunoglobulin epitopes with the lysozyme ones (see A).

Page 6: A geometry-based suite of moleculardocking processes

i

464 A Geometry-based Suite of Molecular Docking Processes

Figure 1. A ribbon representation of the protein kinase complex (lcpk). The red ribbon is the receptor. In various colors, the inhibitor in its crystallographic orientation and the 12 top ranking models obtained in the docking experiment are shown. Note that the native orientation of the inhibitor is indistinguishable from the models. This Figure demon- strates that a considerable number of similar orientations can be generated (see the text).

comparing the ligand in the proposed orientation with the native ligand orientation, also shown in column 2 of Table 2A), and its ranking, based on the surface contact scoring it has accumulated. The ranking of the complex in its native orientation, and of the solution obtained by matching the critical cap-pit point pairs generated for the complex (i.e. the expected solution, also shown in Table 2A) are given in the next columns, along with the r.m.s.d, of the latter solution.

Inspection of the rankings of the docked solutions obtaining the best r.m.s.d, for each of the receptor-ligand pairs, indicates that while they are high, they do not top the list. A reason for this is that there are ranges of solutions having similar conformations, and differing very slightly in their transformations. Indeed, small variations between the rotations and translations of the solutions reflect, to a large extent, insignificant changes in the docked conformations. For example, clustering solutions having very similar transformations for the lfdl test case, the "correct" solution, i.e. that resembling the crystallized complex, would have ranked 7th rather than 22nd (see also Figure 1). However, in our current implementation, no clustering of similar transformations is applied, because the precise thresholds of the rotations and translations that could conceivably be labeled as having the same conformation are still unclear. Another reason for the appearance of high ranking solutions may suggest alternative binding sites. In the last two columns in Table 2B we show the highest ranking solution with a r.m.s.d, less than 1.5 A. As can be seen, very high ranking is achieved. There are three exceptions to the above: 4phv, 4cpa and ltec. The highest ranking of a solution less than 1.5 A in the first case is 327, for the second it is 147, and for the third, ltec, the ranking

is 134. While these rankings are not high, considering the total number of solutions we are left with (listed in the second column), these rankings can still be considered reasonable. Note that this contact scoring routine tends to assign higher rankings to native-like solutions achieving closer contact than the rank assigned to the native orientation.

The combination of the r.m.s.d, obtained, along with the number of the solutions, the ranking of the "correct" solution, and the CPU times required to achieve it are desirable. This is particularly striking when we consider the fact that absolutely no information about the location of the binding sites was included. The quality of the models obtained is illustrated in Figures 1 to 3. Figure 1 demonstrates the results of the docking of the protein ki'nase complex (lcpk). The crystallographic complex is shown along with the 12 top ranking models obtained in the docking experiment. Note that the native orientation of the inhibitor is indistinguishable from the models. This Figure demonstrates also the point about the clustering of similar orientations discussed above. Figure 2 displays the native HIV-1 protease inhibitor scaffold with the model obtained from the best r.m.s.d, docking solution; the two superimpose excellently with only slight deviations. Figure 3 shows the superposition of the pancreatic trypsin inhibitor (4tpi) in its native orientation with the best solution obtained in the docking experiment.

Docking with the active site of the receptor

Above, we have presented the results obtained in the docking of the entire surfaces of both the receptors and the ligands, except for the three immunoglobulin-lysozyme complexes, where only the receptor epitopes were used. Table 2C displays

Page 7: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 465

the results obtained when only the surfaces of the active sites of the receptors (and the entire surfaces of the ligands) are considered in the docking. (That is, with the exception of the immunoglobulin- lysozymes. Here, only the binding sites of both were employed in the generation of Table 2C.) We have chosen to perform these calculations as well, as it is sometimes the case that the active site of the receptor for a particular ligand is known. As expected, with only the binding site regions, the performance of our docking algorithm is truly remarkable. The matching times illustrated in Table 2C are exceedingly short, reflecting a much smaller number of solutions. The ranking of the correct solution is slightly higher. Clearly, the r.m.s.d, of the best solutions are the same as those obtained in the runs with the entire surfaces (Table 2A and B).

Energy evaluation

To gain an insight into the results of modeling, the geometrically acceptable docked conformations obtained from docking the full surfaces of the bound receptor and the ligand molecules (see above) undergo energy evaluation. Note that we apply this evaluation only for the crystallographic complexes that were used as test cases. This was carried out to assess the quality of the docking solutions in the bound molecules. This energy evaluation cannot be applied to the results of docking two molecules whose structures have been elucidated separately and that may require conformational changes upon docking. In the latter, "unbound" case, inter-molecu- lar penetrations may occur. In solution, these would be alleviated by conformational changes of the receptor and ligand upon their association, to attain favorable interactions. The energy evaluation devel- oped and applied here calculates the pairwise van

der Waals and Coulombic interaction between the receptor atoms and ligand "atoms, according to the following formulae:

E=E,.+Ec

where E,. and Ec are determined by:

A B

QiQj E,.- cR,j"

In the above equations, Rq is the distance between atom i of the receptor and atom j of the ligand, van der Waals coefficients A and B and atomic charges Qi and Qi are obtained from the parameterization of CHARMM, while ¢ is the dielectric coefficient, which we chose to be 4R, i. The interactions were turned off when R, > 10 A. To avoid numerical overflow, the energies were linearized when Rq < 35% of the sum of the van der Waals radii of the two atoms. Note that at this distance the pairwise van der Waals energy is about 30,000 times the magnitude of its minimum. Before energy evaluation, transformation was applied to the ligand for each solution obtained from the previous geometric processes (for the test complexes only). If energy accumulated to greater than 1000 kcal/mol, the accumulation stopped and the routine reported the accumulated energy.

For the bound molecules, ranking the solutions by their interaction energies demonstrates the goodness of the geometrically based methodology. All 19 known complexes are analyzed by their intermolecu- lar interaction energies. For each complex, all the available conformations, including the native, the expected and the docking ones, are ranked according to the negativity of the energies. Table 3 summarizes the evaluation. It lists the energies and ranking of the native conformations (PDB), as well as the highest

Figure 2. The HIV-1 protease inhibitor (blue) is superimposed with the best-r.m.s.d, docking model (clay). The widest dimensions face the viewer.

Page 8: A geometry-based suite of moleculardocking processes

466 A Geometry-based Suite of Molecular Docking Processes

Figure 3. The pancreatic trypsin inhibitor (4tpi, shown in black) is superimposed with the best-r.m.s.d, docking model (gray). For clarity only the backbones are shown.

ranking near-native solutions produced by the docking algorithm (best(E); we consider a solution as near-native if its r.m.s.d. ~< 2.0 A for all ligand atoms against the native conformation). Table 3 shows the energies and ranking of the expected and the lowest r.m.s.d, docking conformations (best (r.m.s.d.)), which have been listed in Table 2A. Note that a lowest r.m.s.d, docking conformation may have a higher energy than the one listed in the best(E) column. The energy and ranking of the native conformations is shown to assess the accuracy of the energy function. The rankings of the expected solutions are included to show the accuracy of our surface representation. Lastly, the best(E) and best(r.m.s.d.) columns demonstrate the quality of the docking orientations produced by our method. All the 19 native conformations possess negative interaction energies. Sixteen of them rank at the top, while two (ltgs and 2mhb) rank at the second and one (lcho) at the 12th place. All of the latter three show small energy deviations and atomic r.m.s.d, against the lowest-en- ergy conformations.

Table 3 shows that most of the best near-native docking conformations rank among the top few, while a few rank up to hundreds (the complex 4hvp is an exception, for which there are only five

conformations, while the best is high in energy). The rankings of the models in Table 3 are generally comparable with those shown in Table 2A, possibly indicating a similarity between our simple surface contact score and this energy evaluation. The ranking of the conformations selected by the lowest docking r.m.s.d., also shown in Table 3, is quite interesting. Half of them rank within the top few, while some rank at the hundreds. These conformations are no doubt close to the native ones, judging by their very low r.m.s.d. (Table 2A). Their diversified energies and ranking indicate that around the native conformation, slight displacement is sufficient to move the molecules into clashes. We note that the 12-6 van der Waals potential can go up to hundreds of kcal/mol by an atomic clash of a fraction of an Sngstr6m unit. Another factor affecting the ranking is the appearance of docking conformations that are far away from the native one and are subject to less severe clashes. These relatively comfortable confor- mations can be alternative binding states that either occur in low population in reality, or can serve as a reference for rational drug design or protein engineering.

In conclusion, the energy evaluation reveals how the molecules are coupled in the native complexes and in the models: the native conformations are found to possess the lowest energies; the docking generally meets the potential provided by the data structure, resulting in near-native models; the displacement of the atoms in these models is reflected in their energies against the native ones; such energies are not linear with respect to the r.m.s.d, values; alternative binding conformations can be identified by their low energies, while these energies appear never as low as the native ones. The energy evaluation routine enables one to assess the quality of the models obtained for the test complexes. However, since it is very sensitive to molecular surface penetrations, it is not applicable to cases where the receptor and ligand are crystallized separately, or when they are in solution.

Docking of unbound receptors and ligands

Inevitably, docking of "bound" receptor-ligand pairs, with the goal of an accurate reconstruction of the crystal complexes constitutes merely test cases for a docking methodology A "real-life", longer- range goal of every docking approach is being able to satisfactorily dock two molecules whose struc- tures have been elucidated separatel}~ The problem already becomes difficult if one considers crystal- lized receptors and ligands, and progressively gets much more so if one attempts to dock molecules whose structures have been solved by other, e.g. spectroscopic means, or by predictive ones. We have decided to approach this problem in stages, treating first the crystallographically solved, unbound receptors and ligands. Owing to surface flexibihty, rigid matching of the two unbound molecules is expected to result in molecular penetration. In

Page 9: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 467

solution such penetration can be easily accommo- dated by molecular flexibilit34 particularly by groups of atoms found on molecular surfaces. This suggests that the docking requirements should not be as restrictive in the error thresholds that are allowed. Larger angular variations be tween the receptors and ligands, reducing the distance constraints and being more liberal in the scoring of the solutions, are a necessity Indeed, super imposing the crystal un- bound on the bound molecules already registers numerous atomic clashes, as we demonstrate below. This being the case, we report partial results of applying our docking routines to three unbound examples. Specificall)4 we have generated the surfaces, carried out the docking and applied the surface contact scoring routine. The energy evalu- ation, based on van der Waals and Coulombic interactions is not applied here.

There exist several examples of unbound receptor and unbound ligands that undergo minimal confor- mational changes upon docking. A method that per- forms well in the test case complexes will perform well in such examples, as no difficulty is added to the problem when the unbound molecules do not under- go significant conformational changes upon docking. Thus, instead of listing the results of such dockings, we chose to show our method ' s ability to predict a docking orientation in the more difficult cases, i.e. using unbound molecules that cannot dock in their unbound conformation and that require a significant level of conformational changes. For the cases we have examined, good results have been obtained. However, as expected, the number of solutions, the

Table 3

CPU times and the r.m.s.d, obtained are all larger than for the bound cases demonstrated above.

Here, we have carried out docking for three unbound cases. The goodness of the docked solutions is gauged by super imposing the docked (originally unbound) molecules on similar molecules that have been crystallized together in complexes. These complexes are used here as reference molecules. The two corresponding molecules are super imposed by matching their C a atoms. Table 4A lists the receptors and ligands we have examined, along with the r.m.s.d, resulting by the superimposi- tion of the unbound on the reference molecules. This information is required for the assessment of the r.m.s.d, of the docked solutions computed as the difference between the proposed orientation and the reference orientation of the ligand.

Table 4B displays the results obtained in the docking of the entire receptor and ligand surfaces of the trypsin and of the chymotrypsin and their respective inhibitors, and of the surface of the active site of the immunoglobul in with the full surface of the lysozyme. Inspection of this Table reveals that the unbound molecules have been docked satisfactorily In all three cases, the r.m.s.d, deviations from the crystal, bound complexes are less than 1.5 A. The expected r.m.s.d, obtained from the critical point matching pairs of the unbound, when super imposed on the bound, are of the same magnitude, reflecting the different conformations of the surfaces of the unbound ve r sus the bound molecules at their interfaces. As expected, owing to the more liberal thresholds in the constraints, the numbers of

The energy evaluation on the native complexes and selected docking solutions Best (E) (with r.m.s.d. ~<2.0 ,~)

PDB Expected r.m.s.d.

Complex E E,. Ec Rank rank E E~ Ec Rank all

Best r.m.s.d. (r.m.s.d.)

int. rank

I cpk -122.7 - 8 1 . 0 -41.6 1 4 -112.5 - 7 0 . 9 -41.7 2 0.54 0.47 3 3dfr -50.6 - 3 2 . 8 -17.8 1 68 -0.3 15.0 -15.2 62 0.53 0.53 133 4mbn -51.1 -46.6 -4.5 1 2 -13.2 -7.2 -6.0 3 0.40 0.41 3 4phv -71.8 -70.2 -1.6 1 2 -43.9 --41.0 -2.9 3 0.76 0.76 4 2igf -71.6 -42.7 -28.9 1 6 -70.1 -39.9 -30.2 2 1.40 1.39 3

4cpa -45.0 -42.3 -2.7 1 9 81.8 90.9 -9.1 244 1.77 1.12 757 ltgs -104.8 - 8 3 . 8 -21.0 2 1 17.6 40.0 -22.4 5 0.73 0.59 21 lcho -73.5 -51.8 -21.7 12 3 -83.0 -68.2 -14.8 1 0.93 0.50 16 2ptc -96.3 -75.2 -21.1 1 8 -75.6 - 5 7 . 6 -18.0 2 0.44 0.45 2 ltec -81.7 - 7 1 . 3 -10.4 1 2 107.6 127.5 -19.9 80 0.67 0.85 80 4sgb -66.5 -62.4 -4.2 1 3 -53.8 -51.2 -2.6 4 0.59 0.53 4 2sec -83.0 - 7 1 . 6 -11.4 1 4 -79.2 - 6 6 . 8 -12.4 2 0.67 0.47 3 4tpi -99.1 -78.4 -20.7 1 2 -81.5 -60.0 -21.5 3 0.82 0.45 6

2mhb -89.2 -77.1 -12.1 2 1 -87.6 --68.7 -18.9 3 1.41 0.82 7 4hvp -172.0 -143.7 -28.4 1 3 >1000 >1000 -9.6 4 1.03 1.06 5 4phv2 -159.7 -113.4 -46.3 1 14 -144.9 - 9 1 . 9 -52.9 2 0.75 0.57 18

lfdl -51.5 -54.1 2.6 1 2 868.8 875.1 -6.3 20 1.82 1.37 23 2hfl -39.2 -18.4 -20.8 1 6 -26.6 13.5 -40.1 2 0.84 1.08 146 3hfm -51.0 -38.6 -12.4 1 670 206.5 223.8 -17.3 40 2.00 1.20 500

The van der Waals energy (E~), Coulombic energy (E,) and their sum (E) (in kcal/mol), as well as the ranking on the energy scale are listed for the native complexes in the left half of the Table. The ranking for the "expected" solutions is shown next. The list on the right half of the Table is for the highest ranking solutions (best(E), best on energy scale) which are near-native. The ranking of the solution having the lowest r.m.s.d., best(r.m.s.d.), is also listed (see Table 2A). CPU time spent on evaluating each solution was less than 1 second for all cases. For the smaller ligands, the time ranged from 0.02 to 0.35 second. For the larger ligands, the time ranged from 0.25 to 0.75 second. See Results for a discussion. •

Page 10: A geometry-based suite of moleculardocking processes

!

468 A Geometry-based Suite of Molecular Docking Processes

Table 4

Docking of unbound molecules A. Data

r.m.s.d, r.m.s.d. versus No. of versus No. of

Unbound Ref. ref. C ~ Unbound Ref. ref. C" ligand ligand (,~) matched receptor receptor (~) matched

2ovo lchoi 0.70 49 5cha lchoe 0.36 230 4ptP 2ptci 0.40 56 2ptn 2ptce 0.33 223 1 lyz 2hfly 0.57 128 21ffl 2hflvh 0.00 426

B. Results from the docking algorithm: full receptor surface versus full ligand surface Best solution

Expected Time r.m.s.d. Rot. Trans. r.m.s.d. No. of match

Ligand Receptor (,,~) (rad) (,,~) (,~) seeds (min) Ligand Receptor

caps pits 2ovo 5cha 1.32 0.14 2.60 1.04 119,907 29.7 238. 360 4pti 2ptn 1.01 0.06 1.85 1.39 235,510 46.5 232. 377 llyz ~ 2hfl b 1.28 0.07 3.21 1.65 172,294 55.8 452. 62

C. Results from the scoring procedure Best solution Expected sol. Best rank

Time No. of score r.m.s.d. PDB r.m.s.d, r.m.s.d.

Ligand Receptor sols (min) (,~) Rank rank (,~) Rank (,~) Rank

2ovo 5cha 8490 16.9 1.32 214 54,641 1.04 3439 1.32 214 4pti 2ptn 27,481 54.1 1.01 6013 52,577 1.39 41,468 1.04 4271 llyz 2hfl 8957 38.6 1.28 2436 2435 1.65 762 1.97 1497

A. The unbound molecules used in the docking and their reference, bound, receptor-ligand complexes. The r.m.s.d, of the unbound molecules with their corresponding reference molecules is obtained after superimposing the C ~ atoms of the unbound ligand (receptor) on those of the complexed reference ligand (receptor). The number of C ~ atoms matched is the number of C ~ atoms that were used to compute the best superposition.

B. The results of the docking of the surfaces of the unbound receptors with the unbound ligands. The PDB names of the unbound ligands and receptors are listed in the 1st 2 columns. The columns titled best sol. list the lowest r.m.s.d, solutions obtained by the matching algorithm. The r.m.s.d, is measured by the difference of the positions of the unbound ligand solution with the position of the unbound ligand after superimposing it on its reference ligand. The r.m.s.d., rotation and translation of the optimal solutions are shown. The expected r.m.s.d, is computed from the critical points of the unbound ligand and receptor when brought into the docking orientation of the reference complex. Such docking orientation was computed separately by superimposing the unbound ligand onto its reference ligand and the unbound receptor onto its reference receptor. The number of seed matches produced by the Geometric Hashing matching algorithm, the matching times (in minutes, without the time of scoring, which is shown in C) and the number of ligand critical points (caps) and receptor critical points (pits) are listed. There is no change in the parameters used in each unbound example as compared to those used in Table 2. The only difference is that more overlap (molecular penetration) was allowed by the contact scoring routine.

C. The PDB code names of the unbound ligands and receptors are listed. The next columns list tile numbers of solutions left after the surface contact scoring, the scoring times (in minutes), the best r.m.s.d, obtained along with its ranking. The rankings of the unbound molecules in the reference orientations of the 1st 2 examples are much lower (worse), owing to the numerous clashes (penetrations) between the receptors and the ligands. The r.m.s.d, and rank of the expected solution is computed from the critical points of the ligand and the receptor in their docked position according to their reference complex. Shown are the ranks that such solutions would achieve by the scoring algorithm. The r.m.s.d, and the rank of the 1st solution having an r.m.s.d, below 2.0 ~ are shown.

' In 4pti parts of the following side-chains were truncated before docking; Argl, Lys15, Arg17, Lys26 and Lys46. b For this immunoglobulin-lysozyme example, only the epitope of the immunoglobulin was used as in the immunoglobulin examples

of Table 2. The epitope was computed after superimposing the lysozyme llyz onto chain y of 2hr.

so lu t i ons (see a lso Table 4C) for the t h r e e cases a re h i ghe r , a n d these a re re f l ec ted in the l a r g e r m a t c h i n g C P U t imes tha t a r e r e q u i r e d in the dock ing . N e v e r t h e l e s s , these t imes a re st i l l r e l a t ive ly shor t , less t h a n one h o u r for b o t h the t r y p s i n (47 m i n u t e s ) a n d the c h y m o t r y p s i n (30 minu te s ) . The d o c k i n g of the act ive s i te of the i m m u n o g l o b u l i n w i t h the en t i r e l y s o z y m e sur face r e q u i r e d 56 m i n u t e s . Table 4C s h o w s the r e su l t s of these d o c k i n g s af te r the app l i ca t i on of the su r face contac t scor ing rou t ine . The r a n k i n g s of the l o w e s t r .m.s .d , so lu t i ons a re c l ea r ly no t as h i g h as t hose ach i eved for the b o u n d cases. I t is n o t e w o r t h y tha t the r a n k i n g s of the u n b o u n d m o l e c u l e s in the r e f e rence o r i en ta t ion (PDB rank)

m a t c h i n g s a n d of the e x p e c t e d ones a r e m u c h l o w e r for t r y p s i n a n d c h y m o t r y p s i n w i t h the i r i nh ib i to r s , o w i n g to the m o l e c u l a r p e n e t r a t i o n s p r o d u c e d b y the s u p e r i m p o s i t i o n s . I n d e e d , as can b e s e e n f r o m th is Table, the i n t e r m o l e c u l a r ove r l ap of these so lu t i ons is so l a rge tha t such d o c k e d c o n f o r m a t i o n s a r e una c c e p t a b l e . Th is d e m o n s t r a t e s the p r o b l e m i n h e r e n t in d o c k i n g a r e c e p t o r a n d a l i g a n d w h o s e s t r u c t u r e s have b e e n d e t e r m i n e d w h e n t h e y a re no t in contac t w i t h each o t h e r a n d tha t i n d u c e c o n f o r m a t i o n a l c h a n g e s u p o n dock ing . N o t e that the t h r e e e x a m p l e s e x p l o r e d he re w e r e c h o s e n spec i f i - ca l ly b e c a u s e t h e y a r e k n o w n to u n d e r g o s ign i f i can t c o n f o r m a t i o n a l c h a n g e s u p o n dock ing .

Page 11: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 469

Docking results obtained for the unbound molecules when the active sites of the receptors are employed, with the entire ligand surface for the two proteases, and the lysozyme binding site in the third, immunoglobulin example were also carried out (results not shown). The quality of the best solutions was identical with those obtained using the full surface. However, the number of orientations obtained and the ranks achieved by the contact scoring routine are from one to two orders of magnitude smaller. Correspondingl~ shorter times are achieved (from 3 to 11 minutes only). For example, in the llyz-2hfl experiment, only 2635 solutions were obtained, the best solution ranked 84 and the run took only 3.8 minutes.

Discussion

We have described a suite of geometric docking processes that we have assembled, and their performance. The first of these is the construction of a surface representation by sparse critical points. We have demonstrated that at the interface of the complexes, the points supplied by this representation are remarkably compatible in both their location and the orientation of their associated normals. In a wide range of point densities, from initially about six points per surface atom to about 0.5 after pruning, they have been demonstrated to be capable of supporting quality matches between the two surfaces. The economy of the data structure directly lowers the burden of computational complexity. More importantl:~ the quality of the representation enables the surface complementarity being embod- ied by an abundant number of superimposed points and aligned inter-point vectors and surface normals at the interface. Therefore, a search for geometrically complementary surface areas can be realized by looking for a collection of compatibly arranged points and aligned vectors. This realization effec- tively confines the sampling of the conformational space to the very neighborhoods where complemen- tarity can possibly exist. In contrast, often-adopted approaches such as Monte Carlo and exhaustive sampling, search the space indiscriminatel~ leaving a trace of idle steps. Normal134 the conformational space is severely underpopulated. Being able to locate the target areas in advance presents a great advantage. Our representation provides docking algorithms with such an advantage.

The second ingredient of our approach is the Geometric Hashing, Computer Vision based algor- ithm. This technique is especially geared to matching of three-dimensional, unconnected point sets. It uses a transformation invariant representation of the surface descriptors. The redundantly represented points, in many reference frames, ensures detecting matches of subsets of the points. The efficient organization of the ligand's geometric information in a table allows for fast detection of matching points. The rationale behind the success of this approach is that one can avoid grid searches of the entire conformational space. Any match between the

ligand and the receptor has to involve at least two critical points describing "the molecular surface. Thus, there is at least one matching reference frame in both molecules. It suffices to compare the coordinates of the other critical points in these reference frames. Our docking algorithm specifically exploits the surface information by including both distance and angular constraints within the match- ing procedure.

A "brute force" search on the three rotational and three translational parameters between the ligand points and the receptor points would require L3R 3

steps, where L is the number of ligand points and R is the number of receptor points. The independent preprocessing of the ligand surface information and its indexed storage in a table allows considerable reduction of the search time. The time required is proportional to the number of reference frames considered during recognition, which is O(R2). (Reference frames are constructed using two independent surface points and a vector. The latter is defined as the mean of the normals of the two points.) As the table is accessed for each receptor surface-point on each reference frame, the number of accesses is O(R3). However, because of the vicinity and the reference frame construction constraints (see Appendix II), the actual number of accesses is much smaller, leading to a very efficient search procedure.

The critical importance of geometry in the binding of two molecules is evident also in the successful ranking achieved by the surface contact based scoring routine. While this routine does not calculate the buried surface area between the two molecules, it counts the atoms that are in contact with each other. It allows some surface atom penetration, and penalizes for deeper, interior atom penetration. In most of the examples, this procedure succeeds in ranking the correct solution at the very top. However, the routine tends to favor closer than native-contact solutions.

We have used the energy evaluation routine to examine the outcome of previous processes on the test complexes onl~ to gain an insight into what kind of intermolecular interactions the docking renders the solutions. For this purpose, the routine uses the traditional pairwise 12-6 Lennard-Jones type poten- tial for the van der Waals interactions and the Coulombic function for the electrostatic interactions. For the bound complexes, this evaluation consist- enfly obtained the lowest energies for the native conformations even without taking into account all the contributions to the real binding free energy (e.g. the hydrophobic effect). Although the native interface must have been optimized to a certain degree during crystallographic structure refinement, it is still possible that low-energy conformations exist in different complexes of the two component molecules; these conformations may not be observ- able in nature due to other unfavorable factors such as entrop3~ but can be suggested by docking. Since our docking has sampled the entire molecular surfaces, it appears that at microscopic level, the molecules have their binding sites in che best

Page 12: A geometry-based suite of moleculardocking processes

470 A Geometry-based Suite of Molecular Docking Processes

complementarity of geometrical shape and of charge arrangement, even before taking into account the macroscopic factors such as conformational and environmental entropies.

As is the case for the surface contact routine, the energy evaluation consistently found low r.m.s.d. solutions among a few tens of the lowest-energy conformations. The energies of these solutions are usually negative and close to those of the native ones, indicating that they are virtually the same complexing conformations. There are also low- r.m.s.d, solutions that sport high energies, caused by costly van der Waals interactions. There are also low-energy docking solutions that are high in r.m.s.d. These solutions may indicate potential alternative binding sites. We should note that this energy routine is not purporting to assess the binding free energy; it serves only as an assessment of the goodness of the models proposed for the known complexes.

The advantages outlined above clearly indicate the potential of our suite of docking processes. In particular, the time requirements and the accuracy of the proposed solutions are noteworthy Here, we have demonstrated that our method works satisfac- torily for the 19 test complexes we have examined. This is the first required step in any docking methodology However, some further improvements are needed before this suite can be applied readily to scanning databases of drugs or potential inhibitors. First, robust, objective and consistent pruning criteria of the critical points describing the surface are needed. Although the different pruning thresholds vary very little, a completely automated tool is highly desirable. In the examples shown here, we could have assigned a single, unique set of thresholds. However, when run under the same pruning criteria, while similar (or better) r.m.s.d. values have been achieved, larger numbers of solutions and of CPU times resulted as well. In the derivation of robust pruning criteria, it may be reasonable to allow for some differences in the thresholds depending on the nature of the ligands, e.g. different critical point resolutions can be expected when docking small drugs or large peptides. Second, an efficient filtering procedure is always required. This applies to any docking methodology, due to the large number of solutions that are generated. The surface contact scoring procedure outlined here filters solutions by their contact area between ligand and receptor. Despite its effectiveness in test cases, when scanning large databases such a scoring filter might not be adequate, as larger ligands may achieve higher scores. Normalization of the score by the ligand surface area may resolve such a problem.

In this work, docking conformations covering whole molecular surfaces were computed and filtered with a high level of efficiency Conformations very close to the native ones were found among top-ranked solutions. When docking the components of known complexes, how close a solution is to the native conformation can be gauged by the r.m.s.d.

values. The ranking system we present here shows consistency with the r.m.s.d, gauge. Its effectiveness in identifying native-like solutions is therefore verified with the naturally occurring complexes and is expected to be predictive with such complexes. However, if molecules change shape during binding, the effectiveness of the ranking system may be compromised, depending on the extent of such changes. Although the similarity between com- plexed and isolated proteins has been argued (Janin & Chothia, 1990; Cherfils & Janin, 1993), a range of flexibilities must exist. This is particularly true for surface atoms. Here, we have shown three examples of docking unbound molecules that require signifi- cant conformational changes upon binding. Although solutions close to the correct orientation were achieved, we have obtained modest success in ranking those at the top of the list.

We have investigated two approaches dealing with more flexible systems. The first approach is completely within the rigid body doctrine and the framework presented here. It is to deal with moderate surface flexibility In the surface represen- tation, we will retain fewer points and perform certain smoothing, in order to remove more surface details. Alternative ways to construct the reference frames for the matching procedure will be explored. Here, two critical points and the average of their normals serve as the basis for the reference frame construction. Owing to surface flexibility, this choice may prove less robust when the ligand and the receptor have been crystallized separately, are in solution, or are modeled structures. A different choice is to build a reference frame on three critical points. Although this increases the number of reference frames built on each molecule, smaller subsets of the critical points might suffice. The geometric scoring system will tolerate greater penetration, while making compensation by award- ing on the information provided by other surface features. The energy evaluation will adopt a "softened" potential, allowing assessment as well as filtering of the solutions. These developments will be presented separately

The second approach is to include additional degrees of freedom. Complete atomic freedom is out of the question; it is far too expensive. In that respect, we should note that molecular dynamics can allow all atoms to move; however, during the computing time affordable nowadays it can explore only very few distinct conformations, thus better to be served as a stage of optimization on the solutions predicted by our fast method. Recently, we have implemented a robotics based algorithm that allows motions on hinges (Sandak et al., 1994). While the likely hinge sites in one of the molecules are predefined, the rotational parameters between the parts connected by the hinge are not. Our current implementation is for a single hinge; however, it can be extended straightforwardly to multiple hinges. Initial results in this direction are promising.

The efficiency of the surface representation and geometric matching methodology presented here is

Page 13: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 471

encouraging. For the bound cases, the energy assessment shows the val id i ty of the high-scoring geometr ical solutions. Improvemen t s along the lines discussed above, is projected to al low extension to u n b o u n d cases, including N M R and com pu ted s t ructures and a large-scale application to the scanning of databases of d rugs and potential inhibitors. Such work is in progress.

Methods Our method consists of three major steps: (1) surface

representation construction; (2) geometrical matching and; (3) inter-penetration and surface contact scoring.

The molecular surface representation

Our surface representation consists of a sparse set of critical points nicknamed "caps", "pits" and "belts". These critical points are the face centers abstracted from the convex, concave and saddle areas of the molecular surface (Lee & Richards, 1971; Richards, 1977; Connolly, 1983). They occupy positions key to the shape of the surface and are uniquely and accurately defined. Details of the representation have been described (Linet al., 1994). In the following, we provide a summation. We will use the term "face center" and "critical point" interchangeably when there is no confusion, while the former is referred more strictly to the initial set of critical points that may be reduced to other sets by the pruning processes described in Appendix I.

The face centers are computed upon the faces composing the molecular surface. By virtually rolling a probe ball over the van der Waals surface of a molecule, the molecular surface is created as a mosaic of three types of faces: convex, concave and toroidal. The centroid of each face is determined, and projected onto the surface (in a direction normal to the face) to yield the face center. The~face centers, together with their surface normals at the spot, comprise the set of critical points. The points were also given a size, which was the area of the faces they covered. The set is divided into subsets labeled as caps, pits and belts, each consisting of points originating from the convex, concave and toroidal faces, respectivel3z Because the faces mount on the surface of the exposed atoms and the dents and seams among them, the face centers have covered the strategic locations of the molecular surface.

In the present implementation, we calculate the centroids by numerical integration over dense surface dots, which are obtained by a procedure based on ConnoUy's MS-DOT program (for further information, see Lin et al., 1994). In this stud3~ we use a dot density of 10 do ts /~ ~. This density introduces an uncertainty estimated to be about 0.1/~ in the location of the face centers and about 4 ° in the orientation of their normals (Lin et al., 1994). Atomic van der Waals radii were obtained from the extended-atom module of the CHARMM parameterization (Brooks et al., 1983). The rolling ball was given a radius of 1.8/~ in order to mimic an average organic atom.

With no exceptions, every molecule we have examined has about six face centers per surface atom. At this density of the surface representation, most of the subatomic details are removed. Geometrical details at fine subatomic level are less important as the spherical approximations become less accurate. There is still room for further Seduction from the initial density to obtain an even more succinct data structure, as we have found out. The key concern in

designing pruning schemes is to preserve surface shape at atomic level. We have employed three kinds of pruning operations: (1) using subsets of the face centers according to their original face types; (2) fusing the points that are close to each other at subatomic distances; and (3) removing the points that cover an area too small or too large.

The first operation has the effect of thinning the interlacing points. The second operation soothes more subatomic details. The third operation removes the points that are less descriptive or potentially misrepresentative of the surface shape (see Appendix I for details and for the default pruning thresholds).

In docking, we would not anticipate the same number and the same identity of the points to be paired for all receptor-ligand pairs. For example, the docking simulation may require a larger number of superimposable critical point pairs or better angular values of the normals. Thus, for some of the examples, the pruning thresholds would adopt slightly different values, as will be presented below.

An example of the results after the default pruning operations is illustrated in Figure 4. The molecules are the HW-1 protease and its inhibitor in the complex 4phv (Bone et al., 1991). The Figure demonstrates that after pruning to a density of 0.5 critical point per surface atom for both molecular surfaces, there is still a substantial number of critical points superimposed with their normals aligned at the interface.

The matching algorithm

The basic outline of our Geometric Hashing matching algorithm has been described (Nussinov & Wolfson, 1991; Bachar et al., 1993; Norel et al., 1994b; Fischer et al., 1994). Here we sketch its principles and the distinct features of the current application, which exploits both the critical points and their associated normal information.

Figure 5 presents a flow-chart of the algorithm. There are two stages in the algorithm. The first is the preprocessing, the second is the recognition. The critical feature of this Computer Vision based methodology is its rotation and translation invariant representation of the coordinates of the critical points in many different reference frames (also denoted coordinate frame, reference set or r.f. in Figure 5). This rotation and translation invariant representation allows direct matching, avoiding the exceedingly time-con- suming steps involved in full conformational, grid-like searches. In previous implementations, we have used distances between triplets of points as invariants (Bachar et al., 1993; Norel et al., 1994a) or three points to define a reference frame (C" atoms in Nussinov & Wolfson, 1991 and Fischer et al., 1994). Here, we build a Cartesian reference frame for each pair of (critical) points, and a direction (the mean of their normals). Such a reference frame can be defined unambiguously (see Appendix II). The coordinates of other critical points, within a given radius, are represented in these coordinate flames. In the preprocessing stage, typically the smaller molecule (i.e. the ligand) is considered. For each reference frame in the ligand, the coordinates of the other critical points are computed and stored in a table for use during the recognition step. Since the main purpose of this table structure is to speed up the recognition stage, it is organized to allow direct access during recognition. Specificall~ the address to each table location is the critical point coordinates, and the information stored at this location is the point identity and the reference frame in which these coordinates have been obtained. Notice that each critical point will be stored in many referenc~ frames

Page 14: A geometry-based suite of moleculardocking processes

472 A Geometry-based Suite of Molecular Docking Processes

based on pairs of critical points in its vicinity. This redundant representation allows efficient handling of a partial matching situation (since the active site is a priori unknown, we expect only a partial surface fit) and precludes docking failure due to local mismatch. The above-mentioned preprocessing of the ligand was applied only to critical points labeled as caps, while the receptor has been represented by pits.

In the recognition stage, the other molecule is considered, typically the receptor. A similar calculation is carried out on the receptor critical points (pits) and their normals. For each reference frame in the receptor, the coordinates of the other receptor critical points are computed. For each such point coordinates, the table is accessed at the address defined in order to find matching ligand caps that had close enough coordinates in their reference frames. A match ("vote") is registered for a ligand reference frame if the coordinates are within the allowed error threshold distance, and if additional "goodness"-of-match criteria are satisfied. These include acceptable matching of the directions of the corresponding receptor-ligand normals. In Appendix II, we describe these criteria and the constraints imposed on the matching. If a certain ligand reference frame scored K votes, it serves as an evidence that a superimposition of this frame with the current receptor reference frame will result in the alignment of at least K receptor/ligand critical points with their associated normals. Thus one can compute the three-dimensional rotation and translation that results in a best least-squares fit for the frames and all the K critical point pairs, which contributed to the vote of the ligand reference frame. This transformation can be further improved by several straightforward iterations (see

Appendix II). The resulting matching critical point pairs and the corresponding transformation are referred to as a "seed match". Subsequently, each seed match is verified using the interpenetration and surface contact filter (see below) and optimized using the procedure described in Appendix II.

The surface contact scoring function

Each candidate solution computed in the previous stage is evaluated and ranked according to the "contact area" between the docked molecules (Norel et al., 1994b). This procedure is used iteratively in the transformation improvement stage described in Appendix II. Specifically; the receptor is placed onto a grid of 1 A x l , ~ x l A dividing the grid voxels within the receptor into three layers as follows. First the atomic centers of the receptor are projected onto the grid. For each non-surface atom (interior atom) center, the voxels within the van der Waals radius of this atom plus 1 A are labeled as interior voxels. For each surface atom center, the voxels within the van der Waals radius of the atom are labeled as intermediate voxels. In case of a conflict, the intermediate label overrides a previous ones. Finally, the MS dots (Connolly, 1983) of the receptor surface, sampled at the density of 5 dots/A-" are projected onto the grid, and each voxel where such a dot falls is labeled as a surface voxel. Again, the latest labeling overrides the previous. Note that this receptor preprocess- ing is done only once and the same structure holds for each subsequent verification.

Each candidate transformation is evaluated both for ligand-receptor inter-penetration and their surface contact

(a)

(c)

(b)

(d)

Figure 4. The scaffolds of the HIV-1 protease (brown) and an inhibitor (cyan) are shown in (a). (b) The pits dotting the protease surface (brown stars) and their normals (red lines), as well as the normals of the caps of the inhibitor (green lines). Both the caps and the pits have been pruned to a density near 0.5 points/surface atom. The fusing is operated only on the pits with a 1.5 A threshold. For the area-pruning, caps smaller than 5% of the probe ball area are pruned, while pits smaller than 1% and larger than 10% of the probe ball area are pruned. The normals are shown for the points at the interface only, pointing outward for the inhibitor caps and inward for the protease pits. The caps and pits that can find a counterpart within 2 A are regarded as at the interface. The binding tunnel of the protease faces the viewer. (c) The skeleton of the inhibitor, colored according to atom types (C, white; N, blue; O, red). (d) The dotted surface of the inhibitor, where highlighted face centers can be spotted on each of the convex (gray), concave (red) and saddle-shaped (blue) faces.

Page 15: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 473

PREPROCESSING RECOGNITION

I Ligand Critical Points I Receptor Critical Points I (caps) (pits)

J Choose a ligand / next r.L [ Choose a receptor

a. other points,,, compute their I J [ a.. other poin compute their transformation invariant coordinates I V Itransformation invariant coordinates (x,y,z),

(a,b,c) and add the record B J consult the HT at HASTHABLE (x,y,z) and [r.f. (l 1fl2),13] at HASTHABLE (a,b,c) IIII , I , for each r.f. (Z 1,L2 ) found, tally a vote

. " / • I I

• • i I

i I ~-" ,, J i I i I

,' / ,' /

I ,' / i a I

I I j I

I liga~r.f. ~

Y 1,1 ~ 1 ¢ l

H T H

Consult votes~ I

For each r.f. (l 1,12) with a high number of votes,

find the best least squares match, verify the

match and output

Clear VOTES J

Figure 5. A flow-chart of our Geometric Hashing, docking algorithm. The flow-chart outlines the 2 major stages in our matching technique: the preprocessing and the recognition. The preprocessing is carried out on the critical points (caps) describing the smaller (ligand) molecules. Reference frames ( r . f . ) are built on all pairs of critical points that are to be used in the matching (after pruning) and their normals. The coordinates of all other caps within some radius are calculated in these reference frames. This results in a redundant, transformation invariant, representation of the ligand caps. The points and their reference frames are stored in a table using the coordinates of the points as keys (i.e. addresses). During the recognition stage, pairs of receptor pits and their surface normals are similarly employed in the construction of the reference frames. Other pits, within some radius are represented in these reference frames. The coordinates of these pits constitute keys to access the table, and carry out the matching. Similar coordinates of a ligand cap and a receptor pit register a "vote" for the appropriate ligand reference frame. The number of votes is subsequently counted. Reference frames that have accumulated a large enough number of such votes are kept, and the least-squares match of the point pairs are calculated. The next step in the process, not shown here, involves filtering the solutions by their surface contact (and discarding solutions involving penetration).

score. To discard hypotheses resulting in inter-penetration, the ligand atomic centers are transformed onto the receptor grid. If some of them fall into interior voxels, the hypothesis is discarded due to penetration. One may consider discarding at this stage hypotheses where a certain number of ligand atomic centers fall into intermediate voxels. The hypotheses that have not been discarded are evaluated and ranked according to their surface contact. To evaluate this

contact, the ligand MS surface dots are mapped onto the receptor grid and the following score is computed:

s c o r e = s - f ( j ) - 10i

where s, j and i are the numbers of ligand MS dots falling into surface, intermediate, and interior voxels, respectively, i is required to be less than a specified threshold (between

Page 16: A geometry-based suite of moleculardocking processes

474 A Geometry-based Suite of Molecular Docking Processes

0 and 2, depending on the iteration), otherwise the hypothesis is discarded, f(j) is defined as follows:

f(j) = 6]' + 0.5(1 + q - 6) 2) fly > 6

f(j) = 6j if 2 <~ j <~ 6,

f(j) =5 if j= 1

f( j)= - 1 i f j=0 .

The rationale of this score is to favor surface contact and to penalize (minor) penetrations. It indicates the extent of the contact interface between the molecules, with some allowance for penetrations due to error. After evaluation of the proposed hypotheses, we keep only those with scores above two-thirds of the highest-scoring solution.

Acknowledgements

We thank Raquel Norel for the code of the surface contact scoring routine used in this work. We thank Drs David Covell, Robert Jernigan and, in particular, Jacob Maizel, for helpful discussions, encouragement and interest. We thank the personnel at the Frederick Cancer Research and Development Center for their assistance. The research of R.N. has been sponsored by the National Cancer Institute, DHHS, under Contract no. 1-CO-74102 with Program Resources, Inc. The research of H.L.W. has been supported, in part, by a grant from the Israel Science Foundation administered by the Israel Academy of Sciences. The research of R.N. in Israel has been supported in part by grant no. 91-00219 from the US-Israel Binational Science Foundation (BSF), and by a grant from the Israel Science Foundation administered by the Israel Academy of Sciences. Figure 4 was generated with Insight II (Biosym, San Diego, CA, USA). Figure 2 was generated with Quanta (MSI, St. Louis, MO, USA). This work formed part of the PhD thesis of D.F., Tel Aviv University. The contents of this publication do not necessarily reflect the views or policies of the DHHS, nor does mention of trade names, commercial products, or organization imply endorsement by the U.S. Government.

References Bachar, O., Fischer, D., Nussinov, R. & Wolfson, H. J. (1993).

A Computer Vision based technique for 3-D sequence independent structural comparison of proteins. Protein Eng. 6, 279-288.

Bacon, D. J. & Moult, J. (1992). Docking by least-squares fitting of molecular surface patterns. J. Mol. Biol. 225, 849-858.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535-542.

Bone, R., Vacca, J. E, Anderson, P. S. & Hollowa~ M. K. (1991). X-ray crystal structure of the HIV protease complex with L-700,417, an inhibitor with pseudo C2 symmetry. J. Amer. Chem. Soc. 113, 9382-9384.

Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S. & Karplus, M. (1983). CHARMM: a program for macromolecular energ34 minimization, and dynamics calculations. J. Comput. Chem. 4, 187-217.

Cherfils, J. & Janin, J. (1993). Protein docking algorithms: simulating molecular recognition. Curr. Opin. Struct. Biol. 3, 265-269.

Cherfils, J., Duquerro~ S. & Janin, J. (1991). Protein-protein recognition analyzed by docking simulations. Proteins: Struct. Funct. Genet. 11, 271-280.

Connolly, M. L. (1983). Analytical molecular surface calculation. J. Appl. Crystallogr. 16, 548-558.

Connolly, M. L. (1986). Shape complementarity at the hemoglobin ~ 1 subunit interface. Biopolymers, 25, 1229-1247.

Fischer, D.; Norel, R., Nussinov, R. & Wolfson, H. J. (1993). 3-D docking of protein molecules. In Combinatorial Pattern Matching, Lecture Notes in Computer Science 684, pp. 20-34, Springer Verlag, New York.

Fischer, D., Wolfson, H. J., Lin, S. L. & Nussinov, R. (1994). 3-D, sequence-order independent structural compari- son of a serine protease against the crystallographic database reveals active site similarities: potential implications to evolution and to protein folding. Protein Sci. 3, 769-778.

Janin, J. & Chothia, C. (1990). The structure of protein-protein recognition sites. J. Biol. Chem. 265, 16027-16030.

Jiang, F. & Kim, S.-H. (1991). "Soft docking": matching of molecular surface cubes. J. Mol. Biol. 219, 79-102.

Katchalski-Katzir, E., Shariv, I., Eisenstein, M., Friesem, A. A., Aflalo, C. & Vakser, I. A. (1992). Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc. Natl Acad. Sci., U.S.A. 89, 2195-2199.

Lee, B. & Richards, F. M. (1971). Solvent accessibility of groups in proteins. ]. Mol. Biol. 55, 379-400.

Lin, S. L., Nussinov, R., fischer, D. & Wolfson, H. L. (1994). Molecular surface representation by sparse critical points. Proteins: Struct. Funct. Genet. 18, 94-101.

Norel, R., Fischer, D., Wolfson, H. J. & Nussinov, R. (1994a). Molecular surface recognition by a Computer Vision based technique. Protein Eng. 7, 39-46.

Norel, R., Lin, S. L., Wolfson, H. J. & Nussinov, R. (1994b). Shape complementarity at protein-protein interfaces. Biopolymers, 39, 933-940.

Nussinov, R. & Wolfson, H. J. (1991). Efficient detection of three-dimensional structural motifs in biological macromolecules by Computer Vision techniques. Proc. Natl Acad. Sci., U.S.A. 88, 10495-10499.

Pellegrini, M. & Doniach, S. (1993). Computer simulation of antibody binding specificity. Proteins: Struct. Funct. Genet. 15, 436-444.

Richards, F. M. (1977). Areas, volumes, packing and protein structure. Annu. Rev. Biophys. Bioeng. 6, 151-176.

Sandak, B., Nussinov, R. & Wolfson, H. J. (1995). 3-D flexible docking of molecules. CABIOS, in the Press.

Shoichet, B. K. & Kuntz, I. D. (1991). Protein docking and complementarity. J. Mol. Biol. 221, 79-102.

Walls, P. H. & Sternberg, J. E. (1992). New algorithm to model protein-protein recognition based on surface complementarit~ applications to antibody-antigen docking. J. Mol. Biol. 228, 227-297.

Wang, H. (1991). Grid-search molecular accessible surface algorithm for solving the protein docking problem. J. Comp. Chem. 12, 746--750.

Appendix I

Pruning of the Critical Points As described in the Methods section of the main

text, we have analyzed three kinds of pruning operations: (1) using subsets of the face centers

Page 17: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 475

according to their original face types; (2) fusing the points that are close to each other at subatomic distances; and (3) removing the points that cover an area too small or too large.

In the first pruning operation, only caps were retained for the ligands, and pits for the receptors. This pruning reduced the number of points to about 1.0 to 2.0 points per surface atom. Although different pairings of points contribute to interface complementarity as well, we decided to focus on the cap-pit combination for our first crop of docking experiments, because the convex and concave faces they represented made a more predictable coupling. In the second pruning operation, pits closer than 1.5 A apart were fused together to a single point. The 1.5 A threshold is close to the radius of the smallest non-hydrogen organic atoms; therefore, only the pits clustered at subatomic closeness had been combined. Caps were not subject to fusing because they were rarely in subatomic closeness. At the third pruning operation, we dismissed the caps whose size was less than ~5% of the probe ball surface area, and pits whose size was less than ,-, 1% or greater than ~ 10% of the probe ball area. The dismissed caps were expected to be located at the bottom of dented areas of the surface, which were not much more than two atoms wide; this small area was either a shallow dent contributing negligibly to surface complementarit~ or the bottom of a shaft along which bigger caps were preserved. For the pits, 10% of the probe ball area was near what could be covered by the biggest single pit, so that pits bigger than that were likely to result from too wide-spread fusing and should be discarded. The 1% lower bound removed small pits regarded as less constructive to surface complementarity The initial face centers of a molecule comprise approxi- mately one cap, two pits and three belts per surface atom. Analysis of the pruning operations demonstrates that the first stage retains approxi- mately one cap per surface atom for the ligand, and two pits per surface atom for the receptor. At the second stage, both are about 0.5 per surface atom. The number of pairs at the interface is at upper tens to a couple of hundreds at the first stage, while at the second stage it is from as few as five to a couple of dozen. Judging by the pair distance and normal angles at the two stages, there is no quality deterioration: they change little and in both directions. Using the pairs to compute a transform- ation to obtain r.m.s.d, between ligand atoms and their transformed counterpart, we noted that at the first stage, with about 100 to 200 pair constraints, r.m.s.d, is below 0.9 ~ for all the 19 complexes, most below 0.5 A. At the second stage, when pair constraints are much fewer, r.m.s.d, of the interface atoms is below 0.9 A while the all-atom r.m.s.d, is below 1.5 A. These numbers clearly show that the pruning operations we employed can substantially reduce the size of the data structure, while maintaining a high quality for successful molecular recognition.

Appendix II

Details of the Matching Algorithm Here, we describe the additional criteria and

constraints required during the matching algorithm (see Methods and Figure 5 in the main text).

Reference frame definition A Cartesian reference frame can be defined

unambiguously by two points, a and b, and a direction, d. To define such a coordinate frame, we choose two critical points a,b. We denote their respective normals as a, and b.. The direction d of the reference frame is chosen as the average of a. and b,. Not all pairs of critical points are used to build coordinate frames. The following distance and directional constraints are applied to a pair of points (either caps or pits) a,b (and their normals a, and b.) in order to qualify.

Distance constraints

The distance between points a and b should be within a given range:

drain < dist(a, b) < dm~x

Here, dm~, = 2.5 ~ and dm~x = 9.0 A, for the non-im- munoglobulins. For the immunoglobulins, we used dr~i, = 6.0 ~ and dm,x = 13.0 A.

Directional constraints

The orientation of the normals of a and b should satisfy the following.

Angle betzoeen normals

The angle formed between a, and b, should be below a given threshold:

a cos(a,b,) < 1.2 radians

Torsion angle

The torsion angle formed between a., a, b and b, should be smaller than a given threshold:

Torsion(a., a, b, b.) < 1.4 radians

Angle between each normal and the vector ab

Let x be the unit vector in the directiori of ab. Then the normals should satisfy:

0.87 radians < a cos(a.x), aco,(b.x) < 2.4 radians

and

la cos(a.x) - a cos(b.x)l < 1.2 radians

Vicinity Constraints Given a reference frame r.f.(a, b), built on the

critical points a and b (and the average of their

Page 18: A geometry-based suite of moleculardocking processes

476 A Geometry-based Suite of Molecular Docking Processes

normals) we can represent any point in three dimensions as its coordinates in r.f.(a, b). For each r.f.(a, b), we compute the coordinates in r.f.(a, b) of only those points c for which:

dist(a, c), dist(b, c) < d ~

Here, we used d ~ . = 15.0,~ for the non-im- munoglobulins. For the immunoglobulins we used d ~ = 18.0 A.

Voting Constraints Given a receptor reference frame based on a pair

of points (r,, r2)(~(r~, r2)) the coordinates of all the receptor critical points satisfying the above-men- tioned constraints are computed in this frame. Let us denote the coordinates of a current receptor critical point r3 as ( x , y , z ) . The table built in the pre-processing stage is accessed to extract all the records stored at addresses of the form (a, b, c), where (a, b, c) is in the range (x + c, y + c, z + c); here, c = 1.5 (see Figure 5 of the main text). Each of these hash table triplets (a, b,c) represents the coordinates of a ligand point in a given ligand reference frame. Consider a record stored at an address with key (a, b, c). This record was computed for a ligand point 13 in a ligand reference frame ~(I~, 12). This record contributes one vote to the match between the current receptor reference frame ,~(I'D r2) and the ligand reference frame ~,~(11, 12), since the coordinates of r3 in ~(r~, r2) a re similar to those of 13 in ~(1~, 12). Before a vote is tallied for such a match the following conditions must be met.

Distance constraint

The lengths of each of the sides of the triangle r~, r2, r~ should be similar to the corresponding lengths of the sides of the triangle 11, 12, 13. Here, we allow a length difference of 2 A:

Idist(r,, rj) - dist(1, I~)1 < 2 A, for 1 ~ i, j ~< 3

Angular constraints

The following conditions must be satisfied.

Pairwise normal angles

Denote the normal associated with point c by n,. Then, the angle between any two of n,~, n,2 and n,3 should be similar to the corresponding angle between any two of n,, n~2 and n~3:

la cos(n,n,) - a cos(n,nrj)[ < 0.8 radian, for 1 ~ i, j~<3

Angle between normals and the reference frame axes

Let xt be the unit vector in the direction of 1112, and x, be the unit vector in the direction of rl r2. The angle between n, (and n~2) and x~ should be similar to the angle between the n,~ (and n,2) and Xr:

[a cos(n,x0 - a cos(n,xr)l < 0.8 radian, for i = 1, 2

Torsion angles

The torsion angles formed by any two of r~, r2 and r3 (and their corresponding normals) should be similar to the corresponding torsion angles formed by any two of l~, 12 and 13:

[torsion(n,, I~, lj, n,) - torsion(n,, r~, rj, n,)l

<0.8 radian for i, j = 1, 2, 3

Obtaining A Match After the voting procedure is completed for a

receptor reference frame ~(r l , r2), the ligand reference frame vote counters are inspected to find all ligand ~(l~, 12) with at least MinVotes votes. MinVotes varies in our examples from 7 to 13 (see Table 3 of the main text). For each such ~(1., 12) scoring a large number of votes, we compute the match. As along each vote, we store the pair of matching points (r3, 13), all such matching point pairs and the pairs (1"1, ll), (r2, 12) are used to compute the transformation achieving the best least-squares fit after superposition.

Verifying the Match Given a list of matching ligand-receptor critical

points ("seed match") and the transformation that best superimposes these points, we verify and "optimize" the match as follows. (1) The solution is checked by applying the surface contact scoring procedure described in Methods in the main text. If the score is below a given threshold, the verification stops. Otherwise, (2) each ligand critical point belonging to the seed match is transformed and both the distance to its matching receptor point and the angle between their normals are required to be within given thresholds. Here, we used a maximum allowed distance of about 2.0 A and a maximum difference in the orientation of the normals of 1.0 radian. If a pair of matching points does not satisfy these conditions, it is removed from the seed match. If at least one pair was removed, the transformation is re-computed with the remaining pairs, and the surface contact scoring function is applied again. If the score is below a slightly tighter threshold, the verification stops. Otherwise, (3) "extend" the match to include additional matching pairs (ligand point, receptor point). This is done by the following iterative process.

(1) Transform the ligand according to the current transformation.

(2) For each transformed ligand point, find (using a grid) the receptor points that are at most 2.0 A away and their normals deviate at an angle of at most 1.0 radian. A receptor point satisfying these conditions, is paired with the ligand point and added to the seed match.

(3) Compute the new transformation.

Page 19: A geometry-based suite of moleculardocking processes

A Geometry-based Suite of Molecular Docking Processes 477

(4) Apply the surface contact scoring function. If the score is below a given threshold, end the verification.

(5) Otherwise, go to step (1).

Two iterations of the above procedure are applied. The scoring thresholds are increased from iteration to

iteration. If the whole verification process is executed (i.e. the scoring function r4turns acceptable scores each time), up to four transformations are obtained: the initial one (i.e. that of the original seed match), the one after removal of "bad pairs" (if such were found), and two from the above iterative procedure.

Edited by R. Huber

(Received 30 August 1994; accepted 31 January 1995)