Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v...

19
Evolution on Random Structures Christian M. Reidys Simon M. Fraser SFI WORKING PAPER: 1996-11-082 SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. ©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. www.santafe.edu SANTA FE INSTITUTE

Transcript of Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v...

Page 1: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

Evolution on RandomStructuresChristian M. ReidysSimon M. Fraser

SFI WORKING PAPER: 1996-11-082

SFI Working Papers contain accounts of scientific work of the author(s) and do not necessarily represent theviews of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our externalfaculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, orfunded by an SFI grant.©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensuretimely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rightstherein are maintained by the author(s). It is understood that all persons copying this information willadhere to the terms and constraints invoked by each author's copyright. These works may be reposted onlywith the explicit permission of the copyright holder.www.santafe.edu

SANTA FE INSTITUTE

Page 2: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

Evolution on Random Structures

By

Christian M� Reidysa�b and Simon M� Fraserb

a Los Alamos National LaboratoryTSA�DO�SA� ���� New Mexico

bSanta Fe Institute��� Hyde Park Rd�� Santa Fe� NM ������ USA

�Mailing Address bSanta Fe Institute

��� Hyde Park Rd�� Santa Fe� NM ������ USAPhone �� ����� ������� Fax �� ����� ��������

E�Mail fduck� smfrg�santafe�edu

Page 3: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

� � INTRODUCTION

Abstract� In this paper we investigate the relation between the structure and dynamicsof molecules� using a level of coarse graining at which we consider a molecular structure asa �random structure�� A random structure consists of �i� a �random� contact graph and�ii� a family of relations imposed on its adjacent vertices� The vertex set of the contactgraph is simply the set of all indices of a sequence� and its edges are obtained by pickingsecondary and tertiary bonds �from the set of all possible bonds� in two randomizationprocedures� The corresponding relations associated with the edges are viewed as secondarybase pairing rules and tertiary interaction rules respectively� Mappings of sequences intorandom structures are constructed� Here� the set of all sequences that map into a particularrandom structure is modeled as a random graph in the sequence space� the so called neutralnetwork� We analyze the graph structure of the contact graphs of random structures andtheir union� and show how their graph theoretic properties in�uence the dynamics ofsequences mapping into them� In particular� we see a phase transition �in the limit of longsequences� in the union graph� which is manifested in the emergence of a giant component�The critical parameter for this phase transition is the fraction of tertiary interactions inthe molecule� A replication�deletion experiment reveals that this dramatic change inmolecular structure has signi�cant e�ects on the dynamics of the optimization process�This results in a non�linear relation between the fraction of tertiary interactions in thebiomolecules� and the times taken for a population of sequences to �nd a high��tnesstarget structure� These results have important implications for evolutionary optimizationin biopolymers� in particular the evolution of viruses�

Key words� random structure� sequence�structure mapping� random graph� connectivity�giant component� hitting time� optimization

� Introduction

The folding of molecular sequences into their spatial structures is of central interest inbiophysics� Its properties are important for the understanding of evolutionary optimizationof biopolymers and the theory of molecular evolution� Obviously� the term �structure�can re�ect di�erent levels of coarse graining� In biophysics �structure� is de�ned in termsof some physical conditions� for example minimum free energy or kinetic parameters��Structure� can also be de�ned as the set of all a�ne coordinates of the atoms of a moleculeor it can consist of a list of all pairs of coordinates of the sequence that are paired by meansof chemical bonds� In this paper we will consider �structure� as such a correlation scheme�In particular we will not assume that this scheme has to ful�ll constraints that might arisefrom an embedding in the three dimensional space� A structure will consist of a graph anda multi�set of relations imposed on the extremities of its edges� Its vertices correspond tothe indices of the nucleotides and its edges correspond to the bonds�

Page 4: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

Intensive research has been done in computing the minimum free energy structure of aparticular RNA sequence ��� ��� �� focusing on the question of how to determine themost likely structure for a given sequence� Schuster and coworkers �� provided an impor�tant change of perspectives by concentrating on the mapping from sequences into RNAsecondary structures� in the context of molecular evolutionary biology� These mappingshave been used to induce �tness landscapes in which evolutionary optimization has beenstudied� In those landscapes� for example� the lengths of neutral walks� i�e� paths wheresubsequent sequences have a Hamming distance one and all sequences map into a �xedsecondary structure� have been analyzed ��� �� ���� It has been observed that in case ofAUGC�sequences one can walk almost into maximal Hamming�distance without changingthe structure�

In the last two years a mathematical model for the mapping into RNA secondary structureswas developed ��� ���� which changed the perspective towards graph theoretic propertiesof the sequences that map into a �xed structure� These sequences have been shown toform neutral networks in sequence space �being neutral on the level of structures�� Neutralnetworks themselves can be modeled as random graphs and exhibit threshold values fortheir density and connectivity properties� For biologically�realistic parameters� the theoryimplies the existence of dense and connected networks that percolate sequence space andare thereby well suited to searching for new structures� Simultaneously� exhaustive map�pings using biophysical folding algorithms have been performed �� �� in order to comparethe random graph model with biophysical maps� In case of GC�sequences of length ��this work veri�ed the existence of neutral networks which decomposed into a few �mostly��� or � components� Furthermore� random graph maps have been studied ����� whichhave similar structure� Thus� there are robust features of mappings into RNA secondarystructures ����� which depend more on the combinatorial properties of the molecules� ratherthan particular physical parameters �for example� the strength of certain base pairs��

So far all results have been obtained for a special class of structures� namely RNA sec�ondary structures obeying Watson�Crick base�pairing rules� However� the above resultsimply ����� that molecular structures can be studied from a rather new point of view theyare correlation schemes that induce landscapes that can be searched e�ectively by pointmutations� In this context� some combinatorial features of bonds in biomolecules are ofcentral importance� In a biomolecule there are at a �rst approximation two classes ofchemical bonds that is secondary and tertiary ones� Secondary bonds impose a unique�ness constraint for the involved nucleotides�no base can have more than one secondaryinteraction� In addition� for secondary bonds there exists a context independent base pair�ing rule which determines which pairs of nucleotides are able to establish bonds� Tertiarybonds are known to have context dependent rules and in general do not ful�ll a unique�ness constraint as described above for secondary bonds� In order to generalize the RNArandom graph model to mappings in D�RNA or proteins� random structures have beenintroduced �����

Page 5: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

� INTRODUCTION

6 20

8 16

13 18

9 5

7 14

11

12 19 3 15

1 2 4 10 17

120

19

1817

1615

14

1312 11

109

87

65

4

32

GAG

AC

AG

U

UC G A

G

UA

UC

A

CG

Random structurerealised by a sequence:

120

19

1817

1615

14

1312 11

109

87

65

4

32

Contact graph: Components:

Figure � A contact graph� consisting of an ordered set of vertices �numbered�� between

which there can be either secondary �gray� or tertiary �thin black� edges� together with its

set of components� On the right hand side� the bases of one compatible sequence are shown

on the vertices� indicating that� in the random structure� certain relations associated with

the edges have to be ful�lled �in this case Watson�Crick base�pairing rules��

A random structure consists of the following pieces of data �i� a random contact graph��nm�c�

� and �ii� a family of relations on adjacent vertices� Here� m is the number of sec�ondary bonds� and c� the fraction of nucleotides involved in tertiary interactions� Thecontact graph is a graph on the n indices of the coordinates of a sequence� whose edgeset is the union of the edge sets of two random graphs� The �rst is a ��regular graph on�m vertices� obtained by picking m pairs of indices without replacement� The second isa random graph on n vertices� obtained by selecting the remaining

�n�

� �m edges withindependent probability p � c�

n� Here� c� is the probability of a speci�c nucleotide being

involved in a tertiary interaction� It has been shown in ���� that in the limit of long se�quences almost all vertices of the random contact graphs are contained in tree componentsof logarithmic size �relative to sequence length��

Random structures are thus robust with respect to point mutations� This robustnessis a well known factor in molecular biology and has been discussed in the context ofneutral evolution ���� For example let us consider a point mutation in a nucleotide ofa component� According to the relations �rules� associated with the edges �bonds�� ahigh fraction of all nucleotides of this component has to be changed in order to staycompatible� The probability of the sequence remaining compatible with the structuredecreases exponentially with the size of the component in which the mutation occurred�Hence contact graphs with small components are robust with respect to point mutations�

Page 6: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

a)

b)

c)

C

U A U G C

G

A

G

A U U A G C C G G U U G

C C

C

U

G

C

G

A

A

U

U

A

G

G

C

C

U

C

G

C

C

G G

G

C G

G

U G

A

U

U

U

A U

U

G A

A

U A

G

U

C

Figure � This �gure illustrates the notion of compatibility of a sequence with respect to

a structure� For three types of components �single nucleotides a� nucleotides in secondary

interactions b� and those in secondary and tertiary interactions c�� compatible sets of

vertices are shown�

One central question for the biomolecular optimization process is how well the set ofstructures is searched by point�mutations� In this context� the structure of the unionof two contact graphs �and of biomolecules in general� is of particular importance� Forrandom structures� we can determine the structure of this graph� and show its in�uence onthe dynamics of the optimization process� Here we describe our basic experiment� Let snbe a random structure that has intermediate �tness� and has a high fraction of a population

of sequences on its neutral network� Now suppose that s�n is a target structure having ahigher �tness� Then we would like to know how likely it is that sequences folding into snwill be mutated into sequences folding into the target structure� If we take the union of allbonds of the contact graphs of sn and s�n we obtain a new graph� It is exactly this uniongraph that will allow to describe how �close� the above two structures come� Formallywe could view the union graph as a new type of contact graph� We could then use this todetermine sequences that ful�ll the constraints imposed by both the constituent randomstructures� Now� if on the one hand the union graph decomposes in small components�the above arguments discussed in relation to single contact graphs would apply it ishighly likely that �bi�compatible� sequences exist� which are capable of folding into bothstructures� On the other hand if the union graph exhibits a large component of ordern it is unlikely that bi�compatible sequences will exist� In this component there willbe many cycles which impose large numbers of constraints that can never be ful�lledsimultaneously by a sequence� Therefore it will be di�cult to switch between both sn andthe target structure s�n by point mutations� We will show how the dynamics of this searchprocess change in relation to the m and c� parameters�

Page 7: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

� � RANDOM STRUCTURES

The paper is structured as follows �rst we introduce the concept of random structuresand give some statistics on their graph theoretic properties� Second� we study the relationbetween the structure of the contact graph and the dynamics of evolution on these randomstructures�

� Random structures

��� Terminology

A graph X consists of a tuple �vX� eX� and a map o� t eX �� vX � vX� vX is calledthe vertex set and eX the edge set� An element P � X is called a vertex of X an elementy � X is called an edge� The vertex o�y� is called the origin of y and the vertex t�y�is called the terminus of y o�y�� t�y� are called the extremities of some edge y� There isan obvious notion of Y being a subgraph of X� We call a subgraph Y induced� if for anyP� P � � Y being extremities of an edge y � X� it follows y � Y � A path in X is a sequence�Q�� y�� Q�� y�� � � � � yn� Qn���� where Qi � X� yi � X� o�yi� � Qi and t�yi� � Qi��� Apath such that Q� � Qn�� is called a cycle� X is called connected if any two vertices arevertices of a path of X� A connected graph without cycles is called a tree� Being connectedis an equivalence relation in X� and the maximal connected subsets of vertices are calledcomponents of X� If we let Y be a subgraph of X� then the closure of Y in X� Y � is theinduced subgraph of all vertices of X that are adjacent to some vertices of Y � A subgraphY � X is called dense if and only if Y � X� Finally� a vertex P is called isolated in X ifit is not an extremity to an edge y � X�

We now introduce the contact graph of a random structure� For this purpose� let � � c� � �and � � c� � � be positive constants and n be the sequence length� Suppose that m�n�is a natural number such that �m�n��n is a monotonically increasing sequence with limitlimn��

�mn

� c�� m corresponds to the number of secondary bonds of the molecule�We will write a sequence V � Qn

� as V � �P�� � � � � Pn�� Qn� is a generalized n�cube whose

vertices are sequences of length n over the alphabet A of size �� and in which two sequencesare adjacent if they di�er in exactly one coordinate�

Let now X� be a partial ��factor graph on �m indices� say� f�i� � � � � � �i�mg � f�� � � � � ng� X�

is the contact graph formed by all secondary interactions� Next let X� be the random graphobtained by selecting all possible edges between the n nucleotides except the secondaryedges with probability c��n� Clearly� the graphs X� form a �nite probability space byassigning to each ��regular graph uniform probability� Analogously� the graphs X� form a

�nite probability space where a graph with k edges has probability pk��� p��n

���m�k withp � c��n ���� The graphs X��X� induce the graph X� �X� whose vertex set is f�� � � � � ngand whose edge set e�X� �X�� is the �disjoint� union of all X��X��edges� X� � X� iscalled the contact graph� The probability space formed by the graphs X� � X� will be

Page 8: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

��� Statistics of random structures �

referred to as �nm�c�

A random structure� sn� on n nucleotides of a �nite alphabet A consists of the followingpieces of data

a contact graph X� �X�

a family of symmetric relations �R��Ry�y�X�� where R��Ry � A�A�

Each Ry is supposed to have the property for all a � A there exists one b � A withthe property aRyb� The relation R� is motivated by Watson�Crick base�pairing rules

observed in RNA secondary�structures� For y � X� the relation Ry corresponds to aspeci�c �tertiary� interaction rule that might be context dependent�

A vertex �sequence� V � Qn� is called compatible to sn if and only if

for all bonds y of the partial ��factor graph X� its nucleotides indexed by the extrem�ities fo�y�� t�y�g have the property Po�y�R�Pt�y� �note that since R� is symmetric wealso have Pt�y�R�Po�y��

its nucleotides ful�ll for all tertiary bonds y � X� Po�y�RyPt�y��

The set of compatible vertices with respect to the random structure sn is called C�sn��By construction there are n � �m vertices not incident to an X��edge and there areasymptotically �n� �m�e�c� isolated vertices in X� �X��

��� Statistics of random structures

We already discussed the relation between the contact graph and the robustness of therandom structure with respect to point mutations� In this section we will analyze in detailthe graph structure of �i� the contact graph and �ii� of the union of two contact graphs� Anumber of asymptotic results on contact graphs and their union has been proven in �����As is typical for random graph results� these hold in the limit of long sequences� Here wewill verify the extent to which the results apply for sequences of �nite length�

First the expected number of paths of length � in contact graphs is given by

n�c� ! c��� � ���

This implies that the contact graphs decompose with probability � into small components�For this purpose we give some bounds for c�� c� that are observed in RNA and protein

Page 9: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

� � RANDOM STRUCTURES

structures �� c� ��� and � c� ���� Plugging this into eq���� we obtainimmediately that most nucleotides of the contact graph �X� �X�� are contained in smallcomponents� We further compute the probability of having a mutation in a component ofsize � in the contact graph see Figure a�

0.001

0.01

0.1

1

1 10 100

P(n

od

e in

a c

om

po

nen

t o

f si

ze l)

Component size l

c2 = 0.05

c2 = 0.1

c2 = 0.2

Vertices in graph n

Pro

po

rtio

n o

f n

od

es i

n t

he

larg

est

com

po

nen

t

Cln(n)n

0.0001

0.001

0.01

1000 10,000 100,000 1,000,000

c2 = 0.05

c2 = 0.15

a) b)

Figure a� The distribution of vertices in components of size � in a single contact graph�

for dierent values of c� �the fraction of tertiary bonds�� b� The scaling of the largest

component in a single contact graph with sequence length �number of vertices n� for c� ����� and ����� Also shown are �tted lines with y � Cln�n��n� with C � ��� and ��respectively�

In addition we have also analyzed how the largest component in the contact graph scaleswith the sequence length �Figure b�� According to ���� the largest component is at mostof the order C ln�n� with constant C � � for c� � ����� The above data illustrate� however�that the largest component has in fact size C ln�n� and we have computed values of Cfor two di�erent c� parameters �legend to Figure �� We complete our analysis of contactgraphs by computing the distribution of the numbers of cycles in �X� � X��� The dataclearly indicate that cycles are extremely rare events in contact graphs�see Figure �a�

Next we turn to the structure of the union graph� Following ���� there is a phase transitionin �X� �X �

��� �X� �X �

�� concerning the emergence of a largest component of order C �nwith constant C � � �� The latter transition is expressed in terms of c�� the average numberof tertiary interactions per nucleotide and c�� the fraction of secondary interactions� Theexact result reads that for small values of c� the components of the union graph have sizesbounded by C ln�n� and� in the limit of large sequence length� for c�� c� such that

�c���� c��c� � � � ���

there exists a unique large component of size Kn� K � �� with probability tending to

Page 10: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

��� Mapping in random structures �

Pro

potio

n of

nod

es in

the

larg

est c

ompo

nent

c2

0.001

0.01

0.1

1

500 1000 10000

0.01

0.05

0.1

0.15

0.2

0.30.5

Vertices in graph nP

ropo

rtio

n of

nod

es in

the

larg

est c

ompo

nent

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25

n =102

n =103

n =104� n =105n =106

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5

n = 103

a) b)

Figure a� The size of the largest component in the union of two contact structures

for c� � ��� as a function of c�� Curves are shown for graphs with ������� nodes with

� � c� � ���� and� inset� for a graph of ���� nodes with c� ���� b� The size of the

largest component as a function of sequence length n� Curves for various values of c� are

displayed�

�� We thus compute the fraction of nucleotides contained in the largest component of�X� � X �

�� � �X� � X �

�� and the order of the size of the largest component� see Figure� We next consider the so called sequence of components of �X� � X �

�� � �X� � X �

���This is the multi�set of the sizes of its components ordered by size� For increasing c� thelarge component engulfs the other components� as shown in Figure �� Moreover� since thelargest component is of size C �n� it contains most cycles of the union graph� To illustratethis we computed the distribution of the number of cycles in �X� �X �

�� � �X� �X �

�� forvarious c� parameters� the results of which are shown in Figure �b�

��� Mapping in random structures

In order to study the dynamics of the evolutionary optimization process� we have to es�tablish a mapping between sequences and random structures ����� If we �x a randomstructure sn� the preimage is necessarily contained in C�sn�� the set of compatible se�quences� As we have shown in the previous section� the underlying contact graph hasalmost all vertices in tree components of at most logarithmic size� This contact graphinduces a partition of the indices f�� � � � � ng into its components �Figure ��� Accordingly�we regroup the indices of the nucleotides of a compatible sequence into the components ofthe contact graph� Formally we can now consider each multi�set �Pi� � � � � � Pik�� consistingof nucleotides whose indices belong to a component of the contact graph� to be an element

Page 11: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

�� � RANDOM STRUCTURES

0.0001

0.001

0.01

0.1

1

0 100 200 300 400 500 600 700

c2 = 0.05

c2 = 0.15

c2 = 0.25

Component size l

P(n

od

e in

a c

om

po

nen

t o

f si

ze l)

Figure � The sequence of components in the union of two contact structures for c� � ����with c� � ����� ����� ����� Curves are shown for graphs with ���� nodes data are mean

probabilities from ���� randomly constructed graphs�

of a new alphabet� Ak� Accordingly we can rewrite a compatible sequence as �Ai� � � � � � Ai���� being the number of components of the contact graph��

To illustrate this let us consider an RNA secondary structure with respect to the biophysi�cal alphabet A�U�G�C� The latter has a contact graph whose edges are exactly the pairedpositions f�i�� ik�� � � � � �im� ij�g� These are also all nontrivial components of the contactgraph� The Watson�Crick base pairing rule� AU�UA�GC�CG�GU�UG� corresponds tothe induced alphabet� Accordingly� a compatible sequence �P�� � � � � Pn� can be rewrittenas �Pi� � � � � � Pik � �Pj� � Pj��� � � � � �Pjr � Pjr���� where each pair �Pjk � Pjk��� ful�lls the Watson�Crick base pairing rules� The set of compatible sequences is the vertex set of Qn��

� �Q���

where � is the number of base pairs�

In general the set of compatible sequences is the vertex set of

hYi��

Qni�i

whereXi

i � ni � n �

�i � jAij� h is the number of components� and ni the length of the i�th component ofthe contact graph� Next we construct the preimage of the random structure sn� It willbe a random induced graph by selecting the vertices in each factor Qni

�iwith independent

Page 12: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

��� Mapping in random structures ��

a) Single contact graph

Pro

bab

ility

of

a st

ruct

ure

hav

ing

l cy

cles

0.0001

0.001

0.01

0.1

1.0

0.0001

0.001

0.01

0.1

1.0

0 1 2

0.05

0.1

0.15

0.2

0 5 10 15 20 25 30 35 40 45 50

0.05

0.1

0.15

0.2

0.25

b) Union of two contact graphs

Number of cycles l

Figure � The probabilities of obtaining structures with cycles� in a� single contact graph�

and b� the union of two contact graphs� for dierent values of c��

probability i� Note that �vertex� here corresponds to a multi�set �Pi� � � � � � Pik� consistingof nucleotides whose indices belong to a component of the contact graph of sn� In thissense �vertex� can be viewed as a certain segment of the sequence� i �i being the index ofa component� can be interpreted as the stability of the random structure with respect to amutation that has �i� occurred in the i�th component and that has �ii� led to a compatiblesequence� To summarize� the preimage of a random structure is obtained by selecting cer�tain segments of sequences in Qn

� at random� For this process the mathematical structureof randomly induced subgraphs of generalized n�cubes is of particular relevance� It hasbeen shown ��� ��� that � � �� ���

p��� is a threshold value for density

limn��

nf�n is dense in Qn� g �

�� for � �

� for � �

and also a threshold value for the connectivity of random induced subgraphs of Qn�

limn��

nf�n is connectedg �

�� for � �� ���

p���

� for � �� ���p��� �

Further results on random induced subgraphs can be found in ���� The above resultssuggest that the preimage of a random structure consists of large connected subgraphs insequence space� Finally� in order to de�ne a complete mapping into random structuresby the above procedure we only have to iterate the above process� We obtain mappingsf Qn

� �� fsng by constructing the corresponding preimages as random graphs as follows

Page 13: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

�� � OPTIMIZATION ON RANDOM STRUCTURES

we �x a mapping r fsng � N having the property j i � r�sj� � r�si� and set

f��r �s�� � �n�s�� f��r �si� � �n�si� n�j�i

��n�si� � �n�sj �� �

� Optimization on random structures

In this section we will study the time evolution of �nite populations V of sequences whichare replicated nucleotide by nucleotide with some independent error probability p� Thereplication event itself is a point process more precisely a birth�death process in whichsequences are chosen for replication with respect to their �tness� and randomly for deletion�Our main objective is to determine the time for a population to �nd the neutral network ofa given target structure� the so�called hitting time� depending on the fraction of nucleotidesinvolved in tertiary interactions�

Let � Qn� � fsng be a �xed sequence to structure map� and let f fsng � R be a

mapping assigning a �tness value to each structure� We obtain� taking the compositum ofour maps� a �tness landscape on the sequences

Qn� �� fsng �� R V �� ��V � � sn �� f�sn��

��� The replication�deletion process

A population V� of size N � is a ��nite� multi�set of sequences �Vi j i � N� where fVi j i �N g � Qn

� and N � �� The theory of point processes provides a powerful tool by identifying�Vi j i � N� with an integer valued measure � Qn

� � R�

V � �Vi j i � N� �� � �

NXi��

gVi � where gVi�v� �

�� forV �� Vi� otherwise �

We call the set of sequences where � is nonzero the support of �� Clearly� the restriction

of � to a subgraph Y � Qn� corresponds to considering subpopulations on the vertices of

Y � Note that ��f���s�� is the number of elements of V contained in f���s��

The time evolution of � is then obtained by a mapping from �Vi j i � N� to the family�V �i j i � N� as follows we select an ordered pair �Vl� Vk� where Vl� Vk � fVi j i � N g�For this purpose let ress� be the restriction of � to all sequences that are mapped intos� Clearly the subpopulation that corresponds to ress� consists of sequences all having�tness f�s�� Accordingly the average �tness of � reads

f� �Xs

��f���s��f�s� �

Page 14: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

��� Hitting times �

Now� the �rst coordinate Vl of the above ordered pair is chosen with probability f�sVl��f�among the elements of V� The second coordinate of the above pair is selected withuniform probability on �Vi �� Vl j i � N�� i�e� ���N � ��� We select those pairs of sequencesat equidistant time steps� and for a population of size N we refer to a generation as Nsuch time steps�

Next we map the �rst sequence� Vl � �x�� ���� xn�� into the sequence V � � �x��� ���� x�

n�� Thisis performed by assigning to each coordinate xi a x�i �� xi with probability p where allx�i �� xi are equally distributed and leave the coordinate �xed otherwise� This randommapping l �� v� is called replication� Finally� we delete the second coordinate of the pair�Vl� Vk�� that is Vk and have a mapping �Vl� Vk� �� �Vl� V

��� Thereby we obtain a newfamily by substituting the Vk by the V �� This process is referred to as the replication�

deletion process�

��� Hitting times

Let us describe our basic experiment� First� for a particular value of c� we select two

random structures �s�c��n � s�n

�c���� whose contact graphs �X� �X�� X�

��X �

�� are cycle�free�and have underlying c� � ���� This biologically�realistic value is taken from data on longRNA sequences� in which about ��" of the nucleotides are in Watson�Crick base pairs���� ��� We have chosen c� � �� ���� ���� ����� ���� ����� The mapping f fsng � R is

de�ned as follows� We assign to s�c��n the �tness ��� and to s�n

�c�� ����� independent of c� all the other structures have �tness ��

Second� we construct the neutral networks with respect to s�c��n and s�n

�c��� For any se�

quence that is compatible to both structures� s�c��n and s�n

�c��� we �rst randomly selectone of the ordered pairs �sn� s

n� �s�n� sn�� Then we use successive experiments with anindicator random variable� X� where �X � �� � �biased coin �ips�� to determine intowhich structure a bi�compatible sequence is mapped� In case of X � � in the �rst coin��ipwe map the sequence to the �rst coordinate of the selected pair and are done� In caseof X � � we �ip again� and� for X � � we map the sequence to the second coordinate�otherwise we assign a �tness value of � to the sequence� Accordingly� we obtain two neu�

tral networks� � � �n�s�c��n ���� � �n�s�n

�c��� which induce a random landscape with three�tness values

f��P�� � � � � Pn�� �

���

� for �P�� � � � � Pn� �� �� � ������ for �P�� � � � � Pn� � ����� otherwise

Third� we initialize the replication�deletion process� as described in section ��� by gen�

erating N sequences at random to be on the neutral network of s�c��n � We �x the single

digit error probability p such that pn � � and study the time evolution of this population�The experiments are run until the neutral network of the target structure is found� or a

Page 15: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

� � DISCUSSION

c2

0.0001

0.001

0.01

0.1

1

100 1000 10000

c2 = 0

c2 = 0.1

c2 = 0.15

c2 = 0.25

Hitting time (generations)

Pro

po

rtio

n o

f ru

ns

Mea

n t

ran

siti

on

tim

e (g

ener

atio

ns)

a) b)

0 0.05 0.1 0.15 0.2 0.250

500

1000

1500

2000

2500

Figure � a� The mean transition times for populations moving from the intermediate to

the high��tness structure for various c� values� n � �� population size ��� Data show

mean of ���� runs� which were terminated after ���� ��� generations� In all runs the

target structure was found� except in ��" of the runs with c� � ����� Bars are standarderrors� b� The distributions of hitting times for the same runs�

predetermined number of generations have elapsed �which is high enough that the targetstructure is found in the majority of runs��

The results of these replication�deletion experiments� shown in Figures � and �� clearlyindicate the dramatic e�ect of increasing c� values on the ability of searching populationsto �nd the target structure�

� Discussion

In this paper we have shown that even on the level of random structures there existsa signi�cant relation between structure and dynamics� We have considered moleculesconstructed at random according to very simple rules� which allow for evolutionary opti�mization� In particular� we have shown a dramatic change in the optimization performancein relation to the number of chemical bonds in the molecule�

In some sense random structures are generalizations of biomolecular secondary structures�which are formally planar knot�free graphs together with rules associated with their bonds�Knot�freeness� here� requires that edges in the planar graph do not intersect� i�e� for anytwo edges �i� k��r� l� we have either i r and k � l� or� r i and l � k� Random

Page 16: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

��

0.05 0.1 0.15 0.2 0.250

50

100

150

200

Mea

n h

itti

ng

tim

e(g

ener

atio

ns ×

10-3

)

c2

10055

25

21

23

Figure � For sequences of length ��� and for various c� values� the graph shows the

mean transition times for populations of ���� sequences� moving from the intermediate

to the high��tness structure� Runs were terminated after ��� ��� generations� the lowest

completion rate being at c� � ���� where the target structure was found in only ����" ofthe runs� Means include data from non�complete runs� Bars are standard errors �gures

above the columns are the total number of runs�

structures� as secondary structures ������� induce neutral networks in sequence space� i�e�extended� mostly connected subgraphs consisting of all sequences that are all mapped intothe random structure� Those networks consist of a few components� whose size dependson the fraction of neutral point mutants �with respect to the structure�� They are stableunder point mutations and hence allow for neutral evolution ����

However� random structures di�er from secondary structures in two important regards�First� they may include tertiary interactions� and secondly� they need not satisfy such aknot�freeness condition� Although knot�freeness implies a number of inductive relations onsecondary structures of n�nucleotides ����� it complicates any probabilistic model of struc�tures because of additional conditional probabilities� Such a probabilistic model allowsthe graph structure of the union of two contact graphs �obtained by superimposing thebonds of two random structures� to be determined� and this contains essential informationon how close the corresponding neutral networks come in sequence space�

We have shown in the �rst section that the random graph results of ����� proven in the limitof in�nite sequences� already apply for relatively short sequence lengths� Contact graphsof random structures and secondary structures are similar in that they decompose intocomponents of small sizes� However� the properties of the union of two contact graphs ofrandom structures depend heavily on the fraction of tertiary interactions� For a c� � ���

Page 17: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

�� � DISCUSSION

0 0.05 0.1 0.15 0.2 0.250

50

100

150

200

250

c2

Mea

n h

itti

ng

tim

e (1

03 g

ener

atio

ns)

Figure � The mean hitting times for populations of �� sequences �n � �� under two

dierent experimental regimes� The solid bars show runs as described in section ���� in

which populations may �nd the target structure via alternative� low��tness structures� The

open bars show runs in which the searching population was restricted to the neutral networkof the intermediate structure �see text�� Bars are standard errors of the means of ���

replicate runs�

the critical fraction of tertiary interactions for the emergence of a giant component inthe union graph of two random structures has c� � ���� as a upper bound �eq� �����For c� � � a lower bound on the critical fraction would be ������ The existence of thisdramatic change �Figure a�� which is a phase transition in the limit of long sequences�does not depend on c� as long as c� � �� Known D�structures �for example t�RNA�have values of c� which are well below this critical threshold� with about ��" nucleotidesinvolved in tertiary interactions�

We presented data in section �� which show how the changing structure of the uniongraph� with increasing c� values� in�uences the ability of a population of sequences to�nd a target structure� The data show a highly non�linear response in hitting timeswith increasing c�� For example� for n � ��� a change from � to ��" of tertiary bondsincreases the hitting time ���fold� from ���� to ����� generations �Figure ��� Becauseof computational limitations� experiments could only be performed for relatively shortsequences� However� the emergence of the giant component in the union graph becomesmore sudden for longer sequences� and we thus expect our results to become even moresigni�cant for higher values of n�

Page 18: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

REFERENCES ��

In this context it is interesting to consider an experiment in which searching populationswere restricted to the network of the intermediate structure� More explicitly� sequencesmutating into low��tness structures were replaced by sequences generated at random tobe compatible with the structure of intermediate �tness� Results of such an experimentfor n � � are shown in Figure �� Here we observe signi�cantly longer hitting times�but the non�linear relation with c� still holds� Relating these two experiments allowsinsight into the searching of structure space by evolving sequences� In general� neutralmutations are exactly the driving force for evolutionary search they allow a populationto explore sequence space by moving on the extended neutral network� The signi�cantdi�erence in the hitting times of the basic experiment and that shown in Figure � indicatesthat structure space is searched largely by the mutants that �fall o�� the network of theintermediate structure� However� for low c� parameters� it is possible for a populationto switch directly between neutral networks� In fact� such transitions were observed inthose experiments where no element of the population was able to leave the network of theintermediate �tness structure� That is� sequences of intermediate �tness mutated directlyinto the sequences realizing the target structure� This extreme case is approached inour standard experiment by vastly increasing the �tnesses of the intermediate and targetstructures �������

In summary� we have veri�ed the theoretical results of ���� regarding the properties ofsingle contact graphs and their union for sequences of �nite length� Our experimentalresults show how the fraction of nucleotides in tertiary interactions alters dramatically thestructure of the union of two contact graphs� and the e�ect of this changing structure onthe ability of populations of sequences to move from the neutral network of one structure tothat of another� More generally� we show that a coarse�graining of molecular structure atthe level of contact graphs is a useful approach� and enables the prediction of biologicallyimportant properties of biomolecules� such as the frequency of tertiary interactions�

Acknowledgments We want to thank Christopher L� Barrett for stimulating discussions�and Rob Farber and Bill Reynolds for generous donation of CPU time� SF is funded byDARPA under grant ONR N��������������

References

��� Bollob#as� B� Random Graphs� Academic Press� New York� �����

��� Fontana� W� Ein Computermodell der Evolution�aren Optimierung� PhD thesis� Uni�versity of Vienna� ����� PhD Thesis�

�� Fontana W�� and P� Schuster� A computer model of evolutionary optimization� Bio�physical Chemistry� ��� ������ �����

�� Gr$uner� W�� R� Giegerich� D� Strothmann� C�M� Reidys� J� Weber� I�L� Hofacker�P�F� Stadler� and P� Schuster� Analysis of RNA sequence structure maps by exhaustive

Page 19: Evolution on Random Structures - Amazon Web …...family of relations imp osed on its adjacen t v ertices The ertex set the con tact graph is simply the set of all indices of a sequence

�� REFERENCES

enumeration I� Neutral networks� Monatshefte fuer Chemie� ���� ����� �����

��� Gr$uner� W�� R� Giegerich� D� Strothmann� C�M� Reidys� J� Weber� I�L� Hofacker�P�F� Stadler� and P� Schuster� Analysis of RNA sequence structure maps by exhaustiveenumeration II� Structures of neutral networks and shape space covering� Monatshefte

fuer Chemie� ���� ������ �����

��� Hofacker� I�L�� W� Fontana� P� F� Stadler L� S� Bonhoe�er� M� Tacker� and P� Schuster�Vienna RNA Package� pub�RNA�ViennaRNA����� ftp�itc�univie�ac�at� �PublicDomain Software��

��� Hofacker� I�L� A Statistical Characterisation of the Sequence to Structure Mapping in

RNA� PhD thesis� University of Vienna� ����

��� Hofacker� I�L�� W� Fontana� P�F� Stadler� S� Bonhoe�er� M� Tacker� and P� Schuster�Fast folding and comparison of RNA secondary structures� Monatshefte f� Chemie��� ���� �������� ����

��� Kimura� M� The Neutral Theory of Molecular Evolution� Cambridge Univ� Press�Cambridge� UK� ����

���� Kopp� S�� C�M� Reidys� and P�K� Schuster� Exploration of arti�cial landscapes basedon random graphs� In F� Schweitzer� editor� Self�Organization of Complex Structures�From Individual to Collective Dynamics� Gordon and Breach� London� UK� �����

���� Reidys� C�M� Neutral Networks of RNA Secondary Structures� PhD thesis� Mathsfacultaty Friedrich Schiller Universit$at� Jena� September �����

���� Reidys� C�M� Mapping in random�structures� SIAM Journal of Discrete Mathematics

and Optimization� ����� submitted�

��� Reidys� C�M� Random induced subgraphs of generalized n�cubes part i� Advances inApplied Mathematics� ����� submitted�

��� Reidys� C�M�� P�F� Stadler� and P� Schuster� Generic properties of combinatory mapsand neutral networks of RNA secondary structures� Bull� Math� Biol�� ����� in press�

���� Schuster� P�� W� Fontana� P�F� Stadler� and I�L� Hofacker� From sequences to shapesand back a case study in RNA secondary structures� Proc�Roy�Soc�� �� �����������

���� Tacker� M�� P�F�Stadler� E�G� Bornberg�Bauer� I�L� Hofacker� and P� Schuster� Algo�rithm independence properties and RNA secondary structure predictions� European

Biophysics Journal� submitted� �����

���� Waterman� M�S� Combinatorics of RNA hairpins and cloverleaves� Studies Appl�

Math�� �� ������ �����

���� Zuker� M� mfold����� pub�mfold�tar�Z nrcbsa�bio�nrc�ca� �Public DomainSoftware��