Prolog representation of molecular structures and pattern recognition

11
Journal of Molecular Structure ( Theochem) ,282 (1993) 175- 185 0166-1280/93/$06.00 0 1993 - Elsevier Science Publishers B.V., Amsterdam 175 Prolog representation of molecular structures and pattern recognition Edgardo Garcia *, Luis Miguel Reyes Lab. de Quimica Computational, Dep. de Quimica, Universidade de Brasilia, 70910, Brasilia DF, Brazil (Received 1 November 1991) Abstract Research regarding representation of chemical structures in computers has been performed since the 197Os, with two central goals: information manipulation in chemical data bases and computer assisted molecular synthesis. This work is involved with the latter subject. We show the need for a general flexible and efficient form of representation based on weighted graphs. Some algorithms for structural pattern recognition, such as functional groups, reaction sites, rings, isomorphism and topological symmetry have been developed. These algorithms were implemented in Prolog, a declarative language, used because of its facilities in symbolic computations and its natural back-tracking capability. With the use of this representation we have created a working environment that is simple and effective for the automatic manipulation of chemical structures. This framework also allows easy transformations between different representations, such as connectivity matrices, line notation or topological codes based on path counts, among others, allowing exploration of the particular advantages of each. Introduction To be able to manipulate chemical structures by computers we must first find a way of representing them. The choice of representation has a strong impact on the flexibility and ease of implemen- tation of algorithms used. This choice usually depends on the purpose to which it is directed. The use of strings (alphanumeric sequences) as names of chemical structures is preferred in chemi- cal databases. The Wiswesser line notation [l] is an example of this kind of representation in which a set of naming rules is used to obtain a unique name for a structure. This notation allows the direct comparison of structures by comparing their respective names. Methods that rely on unique names are not well * Corresponding author - permanent address: Edgardo Garcia, SQN 316 bloco A ap. 501, Brasilia DF, 70775, Brazil. suited for the automatic generation and manipu- lation of chemical structures that is necessary in synthesis and reaction design programs. For this application the graph representation of molecules is preferred. Connectivity matrices and ordinary matrix operations are commonly employed to represent and manipulate graphs. Most algorithms for structure manipulation and characterization described in the literature are written in imperative languages like Fortran. Those languages are not designed to tackle sym- bolic problems and for that reason the programs’ codes are long and difficult to follow. In this article we show how Prolog, a declarative language, can be used to describe molecular structures and how the notation used facilitates the implementation of some algorithms. Graph representation in Prolog lists Graphs have been used by many researchers

Transcript of Prolog representation of molecular structures and pattern recognition

Journal of Molecular Structure ( Theochem) ,282 (1993) 175- 185 0166-1280/93/$06.00 0 1993 - Elsevier Science Publishers B.V., Amsterdam

175

Prolog representation of molecular structures and pattern recognition

Edgardo Garcia *, Luis Miguel Reyes

Lab. de Quimica Computational, Dep. de Quimica, Universidade de Brasilia, 70910, Brasilia DF, Brazil

(Received 1 November 1991)

Abstract

Research regarding representation of chemical structures in computers has been performed since the 197Os, with two central goals: information manipulation in chemical data bases and computer assisted molecular synthesis. This work is involved with the latter subject. We show the need for a general flexible and efficient form of representation based on weighted graphs. Some algorithms for structural pattern recognition, such as functional groups, reaction sites, rings, isomorphism and topological symmetry have been developed. These algorithms were implemented in Prolog, a declarative language, used because of its facilities in symbolic computations and its natural back-tracking capability. With the use of this representation we have created a working environment that is simple and effective for the automatic manipulation of chemical structures. This framework also allows easy transformations between different representations, such as connectivity matrices, line notation or topological codes based on path counts, among others, allowing exploration of the particular advantages of each.

Introduction

To be able to manipulate chemical structures by computers we must first find a way of representing them. The choice of representation has a strong impact on the flexibility and ease of implemen- tation of algorithms used. This choice usually depends on the purpose to which it is directed. The use of strings (alphanumeric sequences) as names of chemical structures is preferred in chemi- cal databases. The Wiswesser line notation [l] is an example of this kind of representation in which a set of naming rules is used to obtain a unique name for a structure. This notation allows the direct comparison of structures by comparing their respective names.

Methods that rely on unique names are not well

* Corresponding author - permanent address: Edgardo Garcia, SQN 316 bloco A ap. 501, Brasilia DF, 70775, Brazil.

suited for the automatic generation and manipu- lation of chemical structures that is necessary in synthesis and reaction design programs. For this application the graph representation of molecules is preferred. Connectivity matrices and ordinary matrix operations are commonly employed to represent and manipulate graphs.

Most algorithms for structure manipulation and characterization described in the literature are written in imperative languages like Fortran. Those languages are not designed to tackle sym- bolic problems and for that reason the programs’ codes are long and difficult to follow. In this article we show how Prolog, a declarative language, can be used to describe molecular structures and how the notation used facilitates the implementation of some algorithms.

Graph representation in Prolog lists

Graphs have been used by many researchers

176 E. Garcia and L.M. Reyes/J. Mol. Struct. (Theochem) 282 (1993) 175-185

c3

q-c + L 4

C-b 1

Br I4

H!

Cl_

‘OH 3 2

(9

Graph A

[ b(l.br.l.c.2).

M2.c.2.c.3).

bU.C.3.Cl.4~. b(trans.br.l.cl.4) 1

Graph B

I b(2.c.l.c,2). Ml.c.2.c.3A

b(l,c.2.o,4L), M-.0.4.0.4) 1

[ b(l.c.l.c.2). b(l,c.2.c,3).

Ml.c.2.c,4). b(+.c,2$.2) ]

Graph D

I bac.l.oP. b(*.o.2,o.2) I

[ b(l.c.1.h.R. Ml.c.l.oh.2).

bU.c.1.br.U. b(l.c.l.c.3).

b(s.c.l.c.1) I

Fig. 1. Examples of graphs and Prolog lists representations for some molecules.

[4-71 as a model to represent chemical structures and reactions. Weighted graphs can describe mol- ecules, radicals, ions and stereochemical structures. Graphs ’ “nodes” (vertices) correspond to atoms and “edges” to classical chemical bonds (single, double, . . .). We use “self-incident edges” (loops) to represent ions, radicals and chiral centers. Stereochemical relationships between two atoms (trans, cis, . . .) are described by “special edges” that connect the atoms involved.

Graphs can be directly represented in Prolog as a list of bonds. A Prolog “term” [8-l l] with five arguments describes a bond and its type as b(T, Ai, Li, Aj, Lj)

where T = 1,2,3, trans, cis, r, s, etc. (type of bond); Ai, Aj = c, o, cl, n, br, oh, etc. (type of atom or group); Li, Lj = 1, 2, 3, . . . , n (arbitrary assigned atom labels).

Examples of graphs and Prolog lists represen- tations for some molecules are shown in Fig. 1. (All graphs used in Figs. l-7 are defined in the program listed in the Appendix.)

For brevity, we have omitted the hydrogen atoms bonded to carbon atoms. The term lists in Fig. 1 are “instantiated”. Their arguments are con- stant, expressed in Prolog by numbers, special sym- bols (*, -, +, etc.) or names initiated by lower case letters. The upper case letters, or names that start with these, define variables. If the arguments are defined as variables, these are “uninstantiated”. All lists contain their arguments separated by commas and closed by square brackets.

The use of graphs with some or all of its argu- ments uninstantiated allows great flexibility in the pattern recognition algorithms, as explained in the following sections.

Isomorphism and substructures

Whatever the representation chosen, we face the problems of substructures and isomorphism, i.e. finding a way to compare structures. Molecules that differ only by their labels, keeping the same topology and stereochemistry, are called iso- morphic.

If we take as an example the molecules F and G of Fig. 2, we can see that these could be labeled in many different ways. In fact, there are n! distinct forms of labeling an n-atomic molecule. Although molecules F and G are structurally identical

i4 C/cac3-05-cs

1

C/C;_c3-02-Cl 6

Graph F Graph G

I bU.c.l.c.2). b(l.c.2.c.3). I b(l.c.l,o,l), bCl.o,2.c.3),

b(2.c.3.o.4). Ml.c,3.0.5). b(Z.c.3.o.4). b(l.c.3.c.5).

b(l.o,5,c,O 1 Ml.c.5.c.6) 1

Fig. 2. Identical molecules with labels randomly assigned. There are 6! = 720 different ways of labeling the molecule.

E. Garcia and L.M. Reyes/J. Mol. Strut. (Theochem) 282 (1993) 175-185 177

(isomorphic), we cannot simply compare directly their lists because they are different. Given two structures we can check the isomorphism between them, or find substructures, using all the n! permu- tations of the labels of one structure and comparing each of them with the other structure. This pro- cedure becomes impractical as the size of the mol- ecules grows, owing to its factorial behavior.

The matter of isomorphism and substructures can be handled in two general ways.

(1) Applying a set of naming rules that guaran- tees a unique name for a molecule, no matter how it is labeled.

The IUPAC name is an example of this pro- cedure, but it does not always guarantee a unique name and it is not useful for automatization because of its need to recognize substructures (functional groups, rings, etc.) before applying the naming rules. Although much more efficient than the IUPAC method, the Wiswesser notation util- izes a long set of naming rules and also has the need for previous substructure recognition.

(2) Utilizing an algorithm that detects iso- morphism, relying solely on the connectivity of the molecule.

Discarding the need for previous pattern recog- nition is very important in automatic manipu- lation, because the pattern detection is a special case of the checking for isomorphism. In other words, once the algorithm for isomorphism has been obtained, we can directly derive another algorithm for substructure checking.

Our choice is for the second option, not only because it adapts better to graph representation, but also because the detection of substructures must be totally automatic and rely only on struc- tural information and not on a set of rules.

We now have the problem of finding a way of checking isomorphism and substructures for general applications. Many solutions have been presented in the literature. Most of these are based in obtaining unique and unambiguous names (canonical representations) from molecular topology [ 12- 141, usually represented by connec- tivity matrices. Molecular codes are obtained by

generating all the paths of a graph, and they depend only on the graph’s topology. These codes can not only be utilized for isomorphism, but also in substructures [15] and in structural similarity measures with many applications in quantitative structure-activity relationships [ 161.

Our approach does not require a canonical representation because it is based on direct struc- ture comparison between the graphs. We can think of the mechanism as a way of topologically over- laying both structures. This overlaying can be com- plete when treating isomorphic structures, or partial in the case of a substructure. The following program EQUAL-GRAPH is an adaptation of a pro- cedure to verify set equality described in the specialized literature [9].

equal_graph([ 1, [ I). equal_graph(A, [b(T, Al, Ll, A2, L2)(Rest]) :-

(delete(b(T,Al,Ll,A2,L2),A,C) ; delete(b(T,A2, L2, Al, Ll),A, C)), equal_graph(C, Rest).

The predicate DELETE is defined in the Appendix. In the third line of the program, DELETE is used

twice for the non-directional characteristic of the chemical bond. For instance, a Cl-02 bond is identical to an 02-Cl bond.

The predicate EQUAL_GRAPH(M 1, M2) compares each bond of a molecule Ml with the bonds of the other molecule M2. If the bonds are equal it with- draws them from both lists (Ml, M2) and con- tinues the comparing process. If in the end of the process no bonds are left (which is expressed by an empty set [ 1) then the predicate EQUAL-GRAPH was successful (Ml and M2 are isomorphic). If not, Ml and M2 are not isomorphic and the predicate fails.

Modifying the first line of the program so that it overlays a substructure over a larger structure, results in the following program for checking sub- structures.

substr(A, [ I). substr(A, [b(T, Al, Ll, A2, L2) IRest]) :-

(delete(b(T,Al,Ll,A2,L2),A,C) ;

178 E. Garcia and L.M. Reyes/J. Mol. Struct. (Theochem) 282 (1993) 175-185

delete(b(T, A2, L2, Al, Ll)A, C)), substr(C, Rest).

The predicate substr(M, SubM) will be success- ful if “SubM” is a substructure of “M”, i.e. “SubM” will be reduced to an empty list.

The method utilized is that of checking each bond through recursivity and back tracking. Back tracking is inherent to Prolog so it does not need to be programmed, resulting in short and simple pro- gram codes. Two researchers have already demon- strated, for the predicates EQUAL-GRAPH and SUBSTR, methods for their improvement and their use in handling chemical databases [17]. However, these predicates, as listed above and in Ref. 17, have serious limitations that make them fail in many situations. The flaws are due to the way Prolog does variable instantiation, i.e. the way constants are attributed to these. For the pro- grams to work correctly, it is necessary that at least one of the two input structures has its labels uninstantiated. From now on, we call a graph with uninstantiated labels an “uninstantiated graph”. If both graphs in the procedure are uninstantiated, we may arrive at a wrong answer, for example, the structures are isomorphic when in fact they are not. This behavior is observed in the pair of structures of Fig. 3.

Although we can ascribe only one constant to a variable, nothing prevents ascribing this same con- stant to two (or more) variables. Two variables having distinct names does not imply that they must have distinct values. This means that two or more variables can “fuse” into one; in other words they can become equal, resulting in the “fusion” of nodes (atoms). This is exactly what occurs in the isomorphism test between cubane and the Mobius cubane (graphs H and I of Fig. 3). The node 11 fuses with 12, and Jl fuses with 52 (11 = 12, Jl = J2), making the structures isomorphic. Similarly, in graphs J and K the fusion of nodes 12 and 13, and of nodes 52 and 53 leads to a false isomorphism.

A simple way of avoiding this is to use one structure instantiated and the other not. In the

Graph H

G4-l J Graph K

Fig. 3. Uninstantiated connected graphs.

case of the substructure procedure, the larger struc- ture must be instantiated. This works perfectly well for connected graphs: graphs with all nodes bonded to a common structure. Nevertheless, when we use a non-connected uninstantiated graph such as structure M of Fig. 4 and compare it with an instantiated one such as graph L, variable fusion can still occur, resulting once again in false isomorphism. When using EQUAL_GRAPH(L, M) we obtain an affirmative answer due to the fusion of nodes L4 with L2, which are instantiated with node 2.

This problem is easily solved by checking the “cardinallity” of each graph, i.e. the number of nodes of each structure. Two structures can only be isomorphic if they have the same cardinallity. This modification is included in the following program. isomorphic(A, B) :-

labels(A, LA), labels(B, LB), length(LA, N), length(LB, N), equal_graph(A, B).

The predicate LABELS is defined in the Appendix and LENGTH is a built-in Prolog predicate. In the program’s second and third lines, before checking the isomorphism with EQUAL-GRAPH, the checking of cardinallity is done.

In the case of substructures, a similar problem

E. Garcia and L.M. Reyes/J. Mol. Strut. (Theochem) 282 (1993) 175-185 119

OH,

A 2 A l /OHL5

1 3 Ll L3 L4

Graph M

Graph N Graph 0

2 07

4

Graph P

Fig. 4. Instantiated (L, N, P) and uninstantiated (M, 0) graphs. Graphs M and L are considered isomorphic by program EQUAL-GRAPH due to the instantiation of nodes L4 and L2 with node 2. The wrong instantiation of graph 0 in graph N (as done by program SUBSTR) results in graph P, which is not isomorphic with graph 0.

occurs with graphs N and 0 of Fig. 4. Although 0 is a substructure of N, the fusion of nodes L5 with L2 in node 3 results in graph P, which is a wrong answer.

The problem is solved by checking the graph’s cardinallity before and after using SUBSTR. An uninstantiated graph SG will only be a subgraph of an instantiated graph G if no fusion between variables occur, i.e. if the cardinallity is maintained during the instan- tiation. The following program includes the car- dinallity checking.

substructure(G, SG) :- labels(SG, Ll), length(L1, N), substr(G, SG), labels(SG, L2), length(L2, N).

In Figs. l-4 we used instantiated and uninstan-

tiated graphs. We saw that in the isomorphism and substructure predicates, one of the graphs must be instantiated and the other not. On many occasions it is useful to compare structures with constant labels, but impractical to define each molecule twice - one instantiated and the other uninstan- tiated. In order to define all graphs in an instan- tiated form, a graph uninstantiator named UNINST_GRAPH was created. It is listed in the Appen- dix. This predicate transforms an instantiated structure into one with variable labels, maintain- ing the topology intact.

For example, to check the isomorphism between graphs F and G of Fig. 2, we must uninstantiate one of them before we use the isomorphism checker. Utilizing uninst_graph(G, UG) we obtain the form UG of graph G, whose list is shown below.

UG= [b(l,c,Xl,o,X2),b(l,o,X2,c,X3),

b(2, c, X3, o, X4), b( 1, c, X3, c, X5),

b( 1, c, X5, c, X6)1

We can now use in UG the predicate iso- morphic(F, UG), which will give us “yes” as an answer: the isomorphism between F and G is confirmed.

Applications: pattern recognition

The empirical rules used by chemists to develop synthetic routes, and to estimate relative reactivity and other molecular properties, are usually depen- dent on structural patterns, such as functional groups, rings, aromaticity, symmetry, reactive sites, and others. We show next how the previous programs can be utilized in the development of units for structural characteristics perception (perceptrons).

Functional groups and reaction sites

The substructure procedure can be directly used to detect functional groups in molecules. For instance, to determine whether there is a carbonyl

180 E. Garcia and L.M. Reyes/J. Mol. Struct. (Theochem) 282 (1993) 175-185

group in molecule F of Fig. 2, we need only

Carbonyl = [b(2, c, Ll, o, L2)],

substructure(F, Carbonyl).

To detect an ester group in the same molecule

Ester = [b( 1, c, Ll ,-c, L2), b(2, c, L2, o, L3),

b( 1, c, L2, o, L4), b( 1, o, L4, c, L5)].

substructure(F, Ester).

We can put variables in the place of some atomic symbols and thus represent more generic structures

Carbonyl = [b( 1, Rl, Ll, c, L2), b(2, c, L2, o, L3),

b(l,c,L2,R2,L4)].

substructure(F, Carbonyl).

When the group defined as Carbonyl is over- lapped, the variables Rl and R2 are instantiated with the corresponding atomic symbols of mol- ecule F. Depending on the symbols attributed to these variables, we can tell whether the molecule F is a ketone, an ester, an aldehyde, an amide, or any other group that contains the carbonyl subgroup. If necessary, a further analysis over the atoms that are directly bonded to Rl and R2 will provide better information about the molecule’s functional group. It is not difficult to construct functional group perceptrons using a substructure algorithm and a set of rules.

The disconnected structures also work well with the substructure program, enabling reaction sites to be recognized. Reaction sites are topo- logical patterns that can be used to represent specific reactions. These patterns correspond to the net structural bond change that occurs in the reaction. They represent the substructures in the reactants and product graphs that had their bonds modified. Figure 5 shows a Diels- Alder reaction, in which the reactants, repre- sented by a disconnected graph Q, have as a reactive site the pattern S. For this Diels-Alder reaction to occur in a given graph, this graph must have the substructure S. In the retro- synthetic case, the pattern T (retron) must be recog-

Graph Q Graph R

/ 6 + 1

Graph S Graph T

Fig. 5. Graphs of a Diels-Alder reaction (Q and R) and their respective reactive sites (S and T).

nized in the product molecule, as shown in graph R.

This kind of pattern representation allows us, through the variation of atoms and bonds, to handle reactions with different levels of general- ity. This is very useful in classifying chemical reactions and in the automatic search for synthetic routes.

Rings

Totally uninstantiated graphs, i.e. atoms, bonds and labels expressed by variables, can be used to detect purely topological patterns (such as rings) in which the type of atom and its bonds do not have immediate significance.

There are many algorithms for the recognition of rings in molecules [18-201. Two general approaches to this problem are normally found:

(1) searching for all the rings of a graph; (2) searching for synthetically important rings or

for rings with no more than a specific number of atoms.

One way of obtaining all the rings is generating all the paths of a graph and verifying which ones

E. Garcia and L.M. ReyesjJ. Mol. Struct. (Theochem) 282 (1993) 175-185 181

are cyclic. However, in many cases we only want to know whether a molecule possesses a specific ring, or what rings with up to n atoms exist in the molecule.

The first case is directly solved by the substruc- ture predicate. For instance, to know whether there is a ring of four carbon atoms united by single bonds in any molecule G, it is enough to utilize

Ring= [b(l,c,Ll,c,L2),b(l,c,L2,c,L3),

b(l,c,L3,c,L4),b(l,c,L4,c,Ll)l, substructure(G, Ring).

If we want to look for any four-membered ring without considering the types of chemical bonds, we use

Ring = [b(Tl,Al,Ll,A2,L2),

b(T2, A2, L2, A3, L3),

b(T3, A3, L3, A4, L4),

b(T4,A4,L4,Al,Ll)],

substructure(G, Ring).

In the last example, the types of bonds and atoms were replaced by variables representing only the ring topology.

With the above examples, we can imagine a method of recognizing rings that utilizes templates of all possible topologies. These templates would be defined in a database of totally uninstantiated rings. Then we would only need to compare each template with the structure of interest through the use of substructure. This method would have some disadvantages, such as the difficulty of describing a ring database with many templates, and the fact that any ring that is not described by our database will not be identified.

We developed a program GRAPH-RINGS that utilizes a set of basic uninstantiated templates and applies substructure several times, saving in memory the molecule rings in the form of ordered lists of labels. A set of four basic templates is defined in the Appendix as ringgatterns(N, RP), where N is the number of atoms in the ring and RP is the totally uninstantiated list with the ring’s

topology. To find rings with more than seven atoms, we merely define the additional templates. This program has low efficiency because for each ring of n atoms, it does n redundant instantiations that, although not saved in memory, are unneces- sary. However, its relative inefficiency is balanced by its great simplicity and flexibility.

Independently of the utilized method for finding the cycles in a graph, it is convenient to reduce the graph to an “inner graph” that contains only the ring nodes and ring-connector nodes. We utilize a trimming algorithm PRUNE-GRAPH before the ring detector for that purpose. These programs are shown below.

prune_graph (G, PG) : - cut_loops( G, Gl), cut_stereo(Gl, G2), cut_ext_nodes(G2, PG).

graph_rings( G, RingList) : -

prune_graph(G, GI), labels(G1, L), length&, NL),

ringgattern(Size, Ring), Size =< NL, substructure(G1, Ring), labels(Ring, RL), not ring(RL), assertz( rin g( RL)), fail.

graph_rings( G, RingList) :- findall(X, ring(X), RingList), abolish(ring/l), !.

findall, assertz and abolish are built-in predi- cates. The ol:hers are defined in the Appendix.

The algorithm PRUNE_GRAPH(G, PG) takes out all the special bonds using CUT_LOOPS and CUT-STEREO. The predicate CUT_EXT_NODES uses recurrence to take out every atom that is bonded to only one neighbor (terminal nodes) until only atoms that belong to rings or atoms that connect rings (PG) rl:main.

In Fig. 6 ‘we give an example of two molecules, their internal structures and the number of detected rings.

Structural symmetry

The symmetry of a molecule can be analyzed if we obtain the topological classes to which its atoms

182 E. Garcia and L.M. Reyes/J. Mol. Struct. (Theochem) 282 (1993) 175-185

OH

Graph U Graph V

Graph X

Fig. 6. After applying the program PRUNE-GRAPH on mol- ecules U and W, their inner structures, V and X are obtained. The GRAPH-RINGS program gives the following rings. Molecule V has one ring with five atoms, four rings with six atoms and one ring with seven atoms. Molecule X has one ring with four atoms, two rings with five atoms and one ring with six atoms.

belong. Two atoms are said to be structurally

equivalent if they are of the same type and also have the same complete topological and sterical environment. In other words, two atoms are equivalent if they have the same neighbors, the same neighbor’s neighbors, and so on.

The detection of equivalence is not always a simple task. In the Mobius cubane (graph I, Fig. 3) all atoms are topologically equivalent, which is not obvious from an inspection of the figure.

Attempts to classify atoms by approximate algorithms have been made. However, because these attempts only consider the local neighbor- hood, they cannot guarantee the total equivalence of atoms [21].

Obtained by considering all the paths of a struc- ture, the atomic codes do consider the complete neighborhood. We can separate all the topological

classes of a molecule by comparing its atomic codes. These codes are unique. Therefore, two atoms are equivalent if they have identical codes. One drawback of this method is that all the paths must be generated. A greater problem, however, is that the atomic codes only represent the molecular topology, thus making it difficult to include infor- mation about the types of atoms, the bonds and the stereochemistry. Some programs include informa- tion about the types of bonds through the use of graphs with multiple connections [22]. However, these do not include the self-incident bonds that we use in the representation of ionic and radical molecules. The same problem occurs with special bonds like those used in the representation of stereochemical relations. Stereochemical informa- tion can be transformed into a normal non- weighted graph [23], but this complicates the description of a structure and enlarges the number of paths.

The predicate ISOMORPHIC can be successfully used to check the structural equivalence between atoms. If two atoms are equivalent, each of them must “see” the rest of the molecule in the same way. The program EQUI_NODES does this task by “erasing” each one of the two atoms from the molecule and then checking whether the remainder in each case is isomorphic.

equi_nodes(Ll, L2, G) :- node_sphere(Ll, G, Sl),

node_sphere(L2, G, S2), uninstsraph(S2, US2), isomorphic(S1, US2), !, del_node(Ll, G, Cl), del_node(L2, G, C2), uninstlraph(C2, UC2),

isomorphic(C1, UC2), !.

The auxiliary predicates are defined in the Appendix.

The first lines of the program verify the iso- morphism between the two a spheres (coordina- tion sphere up to the first neighbors) of the atoms with labels Ll and L2 in the instantiated molecule G. Only if the (Y spheres are equal is the isomorph- ism of the rest of the molecule verified.

The program EQUI_NODES can easily recognize

E. Garcia and L.M. ReyesjJ. Mol. Struct. (Theochem) 282 (1993) 175-185 183

Graph Y

Fig. 7. Molecule with steric centers. Atoms 2 and 6 are not structurally equivalent owing to steric differences. The same is valid for atom pairs 9-10 and 3-5.

cases like that shown in Fig. 7, where the labels 2 and 6 are topologically equivalent, but not struc- turally, owing to the differences in the steric centers 3,4 and 5. The same reasoning applies in the case of the bromine atoms 9 and 10, and carbon atoms 3 and 5, which are topologically equivalent.

Conclusion

We have presented here a flexible method of representing chemical structures in the symbolic language Prolog through term lists. We have shown how specific procedures for the manipu- lation of these lists can be adapted to allow its direct application to chemical structures. By the use of this representation it has proved possible to make algorithms to treat - in a simple and elegant way - problems related to pattern recog- nition, such as isomorphism, substructures, rings detection and symmetry. The algorithms were implemented in the Arity Prolog interpreter of 3.4 KLips (logical inferences per second), running in a 386/AT compatible personal computer. In the development of programs, we have placed more emphasis on simplicity than on efficiency. The iso- morphism and substructure predicates are the slowest, which can be verified by highly symmetric structures with many rings, such as cubane and steroids in general. The processing time can be considerably reduced by improving these algorithms and by using faster interpreters or a Prolog compiler.

References

1 2 3

7

8

9

10

11

12 13

14 15

16

17

18

19

20

21

22

23

J.J. Volmer, J. Chem. Ed., 60(3) (1983) 192-196. J. Dugundji and I. Ugi, Top. Curr. Chem., 39 (1973) 19. I. Ugi, J. Bauer, J. Brand& J. Friedrich, J. Gasteiger, C. Jochum and W. Schubert, Angew. Chem. Int. Ed. Engl., 18 (1979) 111-123. P.J. Hansen and P.C. Jurs, J. Chem. Educ., 65(7) (1988) 574-580. PC. Jurs, Computer Software Applications in Chem- istry, Wiley, New York, 1986, Chapter 10. J. Koca, M. Kratochvil, V. Kuasnicka, L. Matyska and J. Pospichal, Synthon model of organic chemistry and synthesis design, Lecture Notes in Chemistry, Vol. 51, Springer, Berlin, 1989. A.T. Balaban, J. Chem. Inf. Comput. Sci., 25 (1985) 334-343. For an introduction to Prolog see Chapters 1 and 2 of Refs. 9, 10 and 11. Programs for list manipulation can be found in Chapters 3 and 9 of Ref. 9, Chapters 3 and 7 of Ref. 10, and Chapter 3 and 4 of Ref. 11. H. Coelho and J.C. Cotta, Prolog by Example: How to Learn, Teach and Use it, Springer, Berlin, 1988, p. 36. W.F. ClocKsin and C.S. Mellish, Programming in Prolog, 3rd edn., Springer, Berlin, 1987. I. Bratko, Prolog Programming for Artificial Intel- ligence, International Computer Science Series, Addison-Wesley, 1986. H.L. Morgan, J. Chem. Sot., 5 (1965) 107-113. J.B. Hendrickson and A.G. Toczko, J. Chem. Inf. Comput. Sci., 23 (1983) 171-177. W. Bremser, Anal. Chim. Acta, 103 (1978) 355-365. M. Randic, J. Chem. Inf. Comput. Sci., 18 (1978) lOl- 107. C. Wilkins, M. Randic, SM. Schuster, R.S. Markin, S. Steiner and L. Dorgan, Anal. Chim. Acta, 133 (1981) 637-645. J.L. Armstrong and D.B. Hibbert, J. Chem. Inf. Comput. Sci., 29 (1989) 51-60. E.J. Corey and G.A. Peterson, J. Am. Chem. Sot., 94(2) (1972) 460-465. W.T. Wipke and T.M. Dyott, J. Chem. Inf. Comput. sci., 15 (1975) 140-147. B.L. Roos-Kozel and W.L. Jorgensen, J. Chem. Inf. Comput. Sci., 21 (1981) 101-111. R.E. Carhart, J. Chem. Inf. Comput. Sci., 18 (1978) 108-l 10. M. Randic, G.M. Brissey, R.B. Spencer and C.L. Wilkins, Comput. Chem., 4 (1980) 27-43. T. Akutsu, J. Chem. Inf. Comput. Sci., 31 (1991) 414- 417.

184 E. Garcia and L.M. Reyes/J. Mol. Struct. (Theochem) 282 (1993) 175-18.5

Appendix

All complementary programs are defined in the following computer listing. It also includes all graphs expressed as Prolog lists.

~zz=‘==‘= Ge”sr*, pI(Jced”rss i=ii___ ---+i=ii==_________ ______-__

delate(H,~HlT1,TJ. ~elsts~H,~XIT1.CXITlIJ :- deleta(ti.T.T1J.

del_alI(X,G,CJ :- flndalI(Y, (memCsr(Y.GJ, Y\=XJ, CJ.

aPPand(cl.L,LJ. apPandttAl~1,L,tAlClJ :- apPand(B,L,CJ.

member(X,~XI_lJ. membsr(X.[_IYJJ :- member(X.YJ.

S/l// Graph unlnstanclator 8 I: 'G' Instanclatsa u 0: 'UG' unlnstanclated

Unlnst_graph(G.UGJ :- flndalI~S,~msmbsr~X,GJ,atom_el~m~X,SJJ,Atom_lIstJ. atrlng_term(UGS,Atom_IletJ, etrlng_term(lJGS,UGJ.

atom_slem(b(Ty,Al,Ll,A~,LZJ,SJ :- etrlng_tsrm(S1,11), strlng_tsrm(S2,12J, concat($X(,Sl,NSlJ. concatOX$.SZ,NSZJ, atom_etrlng(Vl,NSlJ. atom_strlng(VP,NSZJ, S = b(Ty,Al,Vl,A2,V2J.

n///I Bana membershlp

bOnd(b(TY,Al,L1,A2,L2).[b(TY,Al,Ll,A~,lZJlGJJ. bOnO(b(TV,AZ,L2,A1.LlJ~Cb(TV,Al,Ll,A~,LZJlGJJ. bOnd(b(TY,Al,Ll,A2,L2).[_161) :- bOnO(b<Ty,Al,L1,A2,LZJ,GJ.

W//I Remove all terminal nodes from 'G'

cUt_eXt_noQes(G,lnternsO :- dalate(b(_,_,Ll,_,L2).G.A). ( not bond(n(_._,Ll,_,_J,RJ : not bond(C(_,_,LZ._._),R) ). cut_sxt_noaes(R,IntsrnalGJ, !.

cut_ext_noees(lntarnalG,InternblGJ.

%I/// Remove all sslflncldsnt bages from 'G'

cUt_lOOP8(G,GBJ :- Oslsts(b(_,_,Ll._,Ll).G,RJ, cUt_looPs(R.GBJ. 1. cut_looPs<GB,GEJ.

cUt_Stereo(G,G8) :- d(lletO(b(T ._._I_I_ J,G.RJ, (T = trans : T = CIs J. CUt_stereo(R,GBJ, !.

CUt_Stereo(GS,GEJ.

%/I// Ordered list of labels from 'Graph'

laBel8(Graph,LJ :- labelsl(Graph,LLJ, aatof(X,memner(x.LLJ,LJ.

IBbeIS1(c1,[1J :- 1, label5l(tb(_,_. L1,_,L2JIRJ,CL1,L2lLRIJ :- labelsl(R,LRJ.

%I/// Delete all bonds from a node

del_nOde(X,G,CJ :- eel_elI(b(_._,X ,_,_J.G.TJ, del_elI(C(_ ,_I_>_I XJ,T,GJ.

C/I// Alfa sphere of a node

node_sphers(X,G.SJ I- E = b(_._,X._._J. flndalI(6. bond(6.G). SJ.

E. Garcia and L.M. ReyeslJ. Mol. Strut. (Theochem) 282 (1993) 175-185 185

-bti,C,j5,C,J,,;b(i,~,JB;c.Ji); bil;c,Ji,c,jSi. b(l,c,J3,c,J6) I).

graph(l, [b(l,C,l,C,2),b(l,C,2,c.3).b(l,C,E',Oh,4)l~.

graph(m, Cb(l,~,Ll,~,L2),b(1,~,L~,c,L3). b(l,c,L4,oh,L6

graph(n, Cb(l,br,l,c,2),b(l,c,~,c,3),b(l.e.3,o.7),b~l,o b(l,C,6,C,5),b(l,C,5,~,~~,b(l,~,4,~,3)l).

)I).

,7,~.6),

graph(o, Cb(l,br,L4,c,L3),b(l,c,L~,c,L3),b(l.c,Lz,c,Ll~ b(l,O,L6,C,L5)1).