Download - Probabilistic Code for DNA Recognition by Proteins of the ... · Panayiotis V. Benos 1, Alan S. Lapedes2 and Gary D. Stormo * 1Department of Genetics School of Medicine Washington

Transcript
  • Probabilistic Code for DNA Recognition by Proteins ofthe EGR Family

    Panayiotis V. Benos1, Alan S. Lapedes2 and Gary D. Stormo1*

    1Department of GeneticsSchool of MedicineWashington UniversityCampus Box 8232, St. LouisMO 63110, USA

    2Theoretical Division, LosAlamos National LaboratoriesLos Alamos, NM 87545, USA

    A recognition code for protein–DNA interactions would allow for the pre-diction of binding sites based on protein sequence, and the identificationof binding proteins for specific DNA targets. Crystallographic studies ofprotein–DNA complexes showed that a simple, deterministic recognitioncode does not exist. Here, we present a probabilistic recognition code(P-code) that assigns energies to all possible base-pair–amino acid inter-actions for the early growth response factor (EGR) family of zinc-fingertranscription factors. The specific energy values are determined by amaximum likelihood method using examples from in vitro randomisationexperiments (namely, SELEX and phage display) reported in theliterature. The accuracy of the model is tested in several ways, includingthe ability to predict in vivo binding sites of EGR proteins and other non-EGR zinc-finger proteins, and the correlation between predicted andmeasured binding affinities of various EGR proteins to several differentDNA sites. We also show that this model improves significantly upon theprediction capabilities of previous qualitative and quantitative models.The probabilistic code we develop uses information about the interactingpositions between the protein and DNA, but we show that such infor-mation is not necessary, although it reduces the number of parameters tobe determined. We also employ the assumption that the total bindingenergy is the sum of the energies of the individual contacts, but wedescribe how that assumption can be relaxed at the cost of additionalparameters.

    q 2002 Elsevier Science Ltd. All rights reserved

    Keywords: DNA–protein interactions; recognition code; DNA-bindingspecificity; zinc-finger proteins*Corresponding author

    Introduction

    Unravelling the rules that govern the recognitionof the target DNA sequences by transcriptionfactors is one of the great challenges in compu-tational biology today. The regulation of theexpression of any gene in a cell is initiated by thespecific binding of one or more transcriptionfactors in its promoter region. Revealing the mech-anisms of this process would constitute a key steptowards understanding a cell’s regulation and itsresponse to various factors. This can lead to tools

    for engineering gene regulation, with many poten-tial therapeutic applications.

    Research in this field was initiated by Seemanet al. over 25 years ago,1 at a time when the termcomputational biology sounded quite exotic (if notself-contradictory) to most people. The analysis ofthe structure of the amino and nucleic acidresidues led them to postulate that specificity canbe achieved through a network of hydrogen bondsthat can be formed between amino acid residuesand bases. They concluded that two or morebonds per base–amino acid pair are required forefficient discrimination.

    That study raised the hopes that a simple, deter-ministic model (or “recognition code”) might existin nature that will adequately explain the protein–DNA interactions.2 It took 12 more years ofresearch and a handful of protein–DNA co-crystalstructures to realise that, in the course of evolution,nature employs a variety of strategies to achieveprotein–DNA recognition. On the basis of the

    0022-2836/02/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved

    Present address: P.V. Benos, Department of HumanGenetics, Center for Computational Biology andBioinformatics and Cancer Institute, University ofPittsburgh, PA 15261, USA.

    E-mail addresses of the corresponding authors:[email protected]; [email protected]

    Abbreviations used: EGR, early growth response.

    doi:10.1016/S0022-2836(02)00917-8 available online at http://www.idealibrary.com onBw

    J. Mol. Biol. (2002) 323, 701–727

  • differences between the 3D structures of varioustranscription factors, Matthews claimed in 1988that there is “no code for (protein–DNA)recognition”,3 although he made it clear that hewas referring to a simple, deterministic code. Inthe years that followed this publication, theexistence of a recognition code became a highlydebatable issue. We know now that there are clearpreferences for base–amino acid contacts.4 – 9 Thus,although the concept of a universal, deterministicrecognition code has largely been abandoned, atwo-way “probabilistic code” (P-code) constitutesa promising approach to this problem.

    There have been a number of attempts tomathematically model protein–DNA interactions,which have met with variable success.7,10 – 14 All ofthese methods are based on empirical observationsor semi-arbitrarily determined “scores” to evaluatethe DNA specificity of proteins. The currentlyavailable models can be classified into two maincategories. We call a model qualitative if it usesbinary values (e.g. 1 or 0) to describe the inter-actions between bases and amino acid residues.The qualitative models usually consist of a look-up table that associates amino acid residues incertain positions in the protein (“contacting”amino acid positions) with particular bases in cer-tain positions in the DNA target.15,16 By contrast, aquantitative model provides a measure for thebinding affinity for any given protein–DNA pair.The quantitative models usually consist of aweight matrix that assigns a score to the base–amino acid contacts. The score can be position-dependent (as used by Suzuki et al.10) or position-independent (as used by Mandel-Gutfreund &Margalit7,13 and Kono & Sarai12). A qualitativemodel can be viewed as a degenerate quantitativemodel, where each of the weights has a value setto 1 (“permissible”) or 0 (“non-permissible”). Aquantitative model can be transformed intoqualitative by setting a score threshold that willseparate the permissible from the non-permissiblecontacts.

    One feature common to all existing models isthat they consider the individual contacts to beindependent from each other and hence have anadditive contribution to the total binding energy/affinity score. It is known that the additivityassumption is not altogether correct. In fact, theviolation of this assumption in certain situationsconstitutes one of the main arguments against arecognition code. But, for prediction purposes, theadditivity assumption needs to hold only for high-affinity protein–DNA pairs.17 In a few caseswhere it has been examined closely, additivitybetween positions in the binding sites does nothold exactly,18,19 but the data in these studies showthat it provides a good approximation to the truebinding energies.20

    Here, we present a P-code for modelling theprotein–DNA interactions, which can be dividedinto two parts. First, we present a theoreticalframework of how the protein–DNA specificity

    can be described probabilistically. On the basis ofthis framework, which derives from the statisticalmechanics theory, we develop an algorithm thatcan estimate the base–amino acid energetic poten-tials from in vitro randomisation experimentaldata (SELEX and/or phage display, see below).We implemented this algorithm into a programcalled Statistical Algorithm for Modelling Inter-action Energies (SAMIE). The description of thealgorithm can be found in Materials and Methods.We use the early growth response (EGR) proteinfamily as a test case for our algorithm. SAMIE istrained on published data from in vitro selectionexperiments and specifies the position-specificenergetic potentials of the base–amino acid con-tacts. The program determines the values of theparameters that maximise the probability ofobtaining the observed data. For the training, weuse knowledge of structural details of the pro-tein–DNA interface. However, we describe how amodel could be developed without using suchknowledge at the expense of additional parametersthat needed to be determined. Also, for the train-ing, we invoke the assumption of additivity, butwe describe how more complex models withoutthat assumption can be used, again at the expenseof additional parameters, which would requiremore data to determine.

    Secondly, we evaluate the accuracy of thederived model in several ways. In order to choosethe model that best describes the protein–DNAinteractions of the EGR family, we do several train-ings using different partitions of the dataset(SELEX data only, phage display data only orboth) and employing different contacting schemes.The resulting models are compared in terms of pre-diction efficiency on their own training sets inorder to select the one that will be used for sub-sequent analysis/evaluation of the algorithm. Theevaluation is performed in various ways, includinga comparison of predicted relative binding con-stants to those measured for several proteins. Wecompare it to previous models, both qualitativeand quantitative, and demonstrate its improvedaccuracy.

    The EGR protein family

    There are a number of protein families that haveprovided the basis for previous studies of protein–DNA interactions. Among them, the EGR factorfamily is probably the most extensively studiedand therefore we use it here as a test case for ouralgorithm, SAMIE.

    Members of this family were initially identifiedin mammals;21 – 23 recently they were discovered ina variety of other species, including Xenopus laevis24

    and zebrafish.25 The EGR proteins contain threezinc-finger regions, a domain common to manyeukaryotic transcription factors. They belong tothe Cys2His2 subfamily of zinc-fingers, whichderives its name from the residues that are usedfor the coordination of the zinc ion. Cys2His2 is a

    702 Probabilistic Protein-DNA “Recognition Code”

  • eukaryotic motif very common to manytranscription factors, 2719 members in Pfam v7.2.26

    The structure of the three zinc-fingers of themouse protein Zif268 (or EGR1) bound to its con-sensus DNA sequence was initially solvedcrystallographically at 2.1 Å resolution27 and sub-sequently refined to 1.6 Å resolution.28 The crystalstructure showed that each of the three fingerscontacts its DNA target in a modular, antiparallelfashion. Initially, it was believed that each fingerrecognises three bases, but the refinement of thestructure showed that a fourth base is contacted aswell. In each finger there are four amino acidpositions involved in contacts with four DNA pos-itions. For some DNA positions, only one base iscontacted, whereas in others both the forward andthe complementary base are contacted by aminoacid residues in adjacent fingers (overlapping basepositions). In particular, the amino acid residues atpositions 21, þ3 and þ6 in each finger (withrespect to the beginning of the a-helix) contact thebases at positions 30, middle and 50, respectively;27the amino acid residue at position þ2 of the helixcontacts the complementary strand on the fourthposition (overlapping base).28 This contactingscheme is illustrated in Figure 1. We note thatsome of the EGR variants have been shown todeviate from this pattern of contacts.29

    Data from selection experiments

    Protein–DNA interaction data can generally bedivided into three categories. One refers to pairsof sequences (protein and DNA) that areknown to bind to each other with some affinity.Typically, these examples are the result ofexperiments that aimed to measure directly thebinding affinity of the protein (or its mutants) toparticular DNA targets. This category includes thewild-type protein sequences bound to their in vivotargets.

    The remaining two categories refer to dataderived from in vitro selection experiments, namelySELEX and phage display. In a SELEX experiment,a protein of known sequence is used to selectDNA targets from a pool of randomisedoligonucleotides.30 This procedure usually yields

    more than one DNA target. Although one wouldassume that all selected targets exhibit high affinitytowards the protein, the highest affinity targetmight not be selected at all (for purely stochasticreasons).

    In a phage display experiment, the reverserandomisation/selection procedure applies.31 Arecombinant DNA library is constructed thatconsists of variants of the cDNA sequence of aknown DNA-binding protein. The nucleotides thatcode for certain amino acid positions arerandomised (usually, these are the positions thatare known to contact the DNA or they areassumed to do so). Upon expression, the poly-peptides are displayed on the outer coat of thephages. Thus, proteins can be selected that bindwith high affinity to a certain (fixed) DNA target.As in the case of SELEX, multiple proteins areusually selected for any given DNA target and theprotein that exhibits the highest affinity towardsthe particular DNA sequence might not be amongthose selected.

    Theory

    The P-code model for protein–DNA recognition

    In this section, we present the theoretical frame-work of the protein–DNA recognition uponwhich our method is based.

    Thermodynamic aspects of the protein–DNA recognition

    The recognition of specific DNA sequences bythe corresponding transcription factors can be acomplicated, multi-step process.32 Nonetheless, ifwe assume that a protein comes into contact withvarious DNA sequences via diffusion, then inequilibrium we would expect that the time that itinteracts with each of them will be inverselyproportional to their dissociation constants KD.

    Consider a DNA-binding protein, A, that has Ntotpossible targets. For example, if the DNA target forthis protein is L bases long, then there will beNtot ¼ 4L possible targets. Each of these targets willhave a relative frequency Pn(kN ) among all possible

    Figure 1. Representation of thebinding of the EGR protein to itsDNA target. According to crystallo-graphic studies, each of the threezinc-finger domains of the EGRprotein contacts four bases in anantiparallel fashion. There is onebase overlap in the target sequencebetween any two adjacent fingers.The numbering of the amino acidresidues is with respect to thebeginning of the a-helix. Amino

    acid residues 21, þ3 and þ6 contact bases at positions 3, 2 and 1, respectively, whereas amino acid residue þ2contacts the complementary base at position 4 (overlapping base).

    Probabilistic Protein-DNA “Recognition Code” 703

  • binding sites, for example in the genome. Assum-ing equilibrium, the conditional probability thatthis protein will be bound to a particular DNAtarget, kN, is given by the following equation:

    PðkNlAÞ ¼ PnðkNÞe2HðkN;AÞ=X

    k0Pnðk0NÞe2Hðk0N;AÞ

    !

    ð1Þwhere H(kN,A) is the binding energy of the inter-action and the sum in the denominator is thepartition function over all possible DNA targetsequences, k0N (k

    0 ¼ 1,…,Ntot). This formula derivesfrom the Boltzmann distribution of the statisticalmechanics theory33 and it relates the energy valueswith the binding probabilities.

    As evident from equation (1), the DNAspecificity of a given protein is determined by therelative energy, and the absolute energy values arenot important. (This assumes that the protein isalways bound to DNA, which is essentially true invivo unless it is prohibited from binding by somemechanism, such as binding to an inhibitoryprotein.) One can subtract any constant from all ofthe energy values and the probability of bindingwill be unchanged. This formula follows the con-vention that lower energy values correspond tostronger binding (or longer dissociation times).Sometimes, the value of zero is assigned to thelowest-energy state and all other states have posi-tive values. In this study, we allow the energyvalues to take both positive and negative values;more negative values correspond to stronger bind-ing (higher affinity). We arbitrarily assign anenergy of zero to the average affinity, so that thosesequences that bind better than average havenegative energy and those that bind worse arepositive.

    Methods for modelling the protein–DNA interactions

    In theory, one could determine the bindingenergy for a protein of interest to all possible targetsequences. In this case, equation (1) would be anaccurate description of the distribution of theprotein on the various sequences at equilibrium.Note, however, that this would not constitute amodel for the protein–DNA interaction, butsimply a look-up table that stores the measuredvalues. And of course, for a complete represen-tation of a given protein family, one should deter-mine such a table for all possible proteinsequences (by which we mean all possible variantsat the residues involved in the sequence-specificDNA binding). For example, a typical DNA-binding protein has binding sites of about tenbase-pairs. There are over 106 possible sequencesfor a 10 bp site. If the protein uses ten amino acidresidues to recognise its DNA target, then thenumber of all possible proteins (with respect tothese positions) is 2010, which is greater than 1013.

    A complete look-up table for this protein familywould contain the affinity values of each of theseproteins to each of these DNA targets (.1019

    values). Having a mathematical model for theinteraction, some kind of code, allows one topredict the interaction energy for all possible com-binations based on measurements for only afraction of them.

    In the simplest model, only certain combinationsof amino acid residues and base-pairs have favour-able affinity and those combinations are usedexclusively for the protein–DNA contacts. Thiswas the type of code sought by Seeman et al.,1 butonly a few crystal structures of protein–DNA com-plexes were sufficient to determine that such asimple code does not exist.3 Recent qualitativemodels for the EGR family are similar, but theyallow for multiple amino acid residues to interactfavourably with each base-pair and multiple base-pairs to interact favourably with each aminoacid.14 – 16,31,34 Thus they are simple, degeneratecodes in which the amino acid and base-pair com-binations are categorised as either permissible ornot. Such codes have been useful in modelling cer-tain protein–DNA interactions, and even indesigning proteins with high affinity to specificsequences. However, by design they cannot predictthe affinities of different sequence combinationsand, as we show below, about 40% of the inter-actions obtained in SELEX and phage displayexperiments are not accounted for by thesemodels.

    The next more complicated models are quanti-tative, with a score associated with each base-pair–amino acid combination, and additive. Thatis, the base-pair–amino acid contacts are con-sidered to be energetically independent and there-fore, the total energy of the interaction is merelythe sum of energies over all contacts. The numberof parameters in the additive models is pro-portional to the number of contacts, not thenumber of sequence combinations. Each contacthas (at most) 80 parameters, for all possible combi-nations of base-pairs with amino acid residues.Using the example above, the number of par-ameters one needs to estimate is 800 for the 10 bpDNA target (assuming that each of the ten aminoacid residues contacts one base only). This is a sig-nificantly smaller number than the list of .1019

    possible combinations. However, the effortrequired to determine 800 energy values experi-mentally is still large.

    Representation of the model

    Figure 2 illustrates the model for protein–DNAinteraction using a single zinc-finger as anexample. In Figure 2(a), the four positions of thebinding site are listed along the left side of theTable, with each of the possible bases for eachposition shown. The “positions” in the protein arelisted across the top of the Table. These are onlythe protein residues that are directly involved in

    704 Probabilistic Protein-DNA “Recognition Code”

  • contacting the bases in the binding sites, as deter-mined from crystallography, and are positions 21,þ2, þ3 and þ6 relative the a-helix of thezinc-finger. Each position can be occupied by anyof the 20 amino acid residues, but only a few ofthem are listed to save space. The elements of theTable are energies of interactions between specificamino acid residues and base-pairs, and it allowsthe calculation of interaction energy for any proteinsequence with any DNA sequence, using the addi-tivity approximation. For a specific protein bindingto a specific DNA, the interaction energy is simplythe sum of the contacts for those particularsequences. For example, if the protein sequence isTVWY (at positions 21, þ2, þ3 and þ6 in thezinc-finger, respectively), then only the columnscorresponding to those amino acid residues, indi-cated in the Figure, are relevant to the energycalculation. Likewise, if the binding site containsthe sequence ACAG, then only the rows corre-sponding to that sequence are relevant to the bind-ing energy. So the total binding energy might bethe sum of the 16 values where those rows andcolumns intersect. However, for this protein weknow that the amino acid residue at position 21interacts only with the 30 DNA position, theresidue at position þ3 interacts with the middleposition, etc., so that only the intersectionsshown with filled circles contribute to thebinding energy. If we did not know whichamino acid residues interacted with which DNApositions, we would have to include all of the

    possible interactions in calculating the energy (allof the circles), but presumably with enough train-ing data the interacting positions would becomeclear as the only significant contributions tospecificity.

    The interaction energies are denoted by Tijab,

    where i and j are the positions in the amino acidand nucleotide sequences, respectively; a and bare the specific amino acid residues and bases thatoccur at those positions; and C is the “connectivitymatrix” that indicates which positions in the pro-tein contact which positions in the DNA (Figure2(b)). Matrix element Cij is “1” for those contactingamino acid and base positions and “0” elsewhere.As such, only the positions in contact contribute tothe final energy. Together the T and C matricesallow the calculation of binding energy for anyzinc-finger protein with any four-long DNAsequence. To get the specific interaction between aparticular protein and a particular DNA, one justsums the elements of T that correspond to thoseinteractions, as shown in the Figure. In theequation (Figure 2(c)), N is a unary vector that con-tains 1 for the base that occurs at each position, and0 elsewhere. Likewise, A is a unary vector for theamino acid sequence. It contains a 1 for the aminoacid at each position and 0 elsewhere. Multiplyingthem with the T matrix serves to select out onlythose elements that correspond to the interactingpositions, as shown in the Figure, the sum ofwhich constitute the predicted binding energy ofthat interacting pair. Additional details of the

    Figure 2. A representation of theenergy calculation. SAMIE exploitsin vitro randomisation experimentsand calculates the weight matrix Tthat maximises the likelihood ofthe training set according toequation (1) or (3). (a) A graphicalrepresentation of such a weightmatrix for one zinc-finger of theEGR protein family. Each finger ofthe EGR proteins uses four aminoacid residue positions (numbered21, þ2, þ3 and þ6 from the begin-ning of the a-helix) to recognise a4 bp DNA target. Each small sub-matrix of matrix T consists of 80values, which correspond to singlebase–amino acid energetic poten-tials (20 amino acid residues £ 4bases). For calculating the predictedenergy of the interaction of a givenprotein sequence of this family (e.g.TVWY) to a given DNA sequence

    (e.g. ACAG), we encode the two sequences in two unary vectors, A and N, as we describe in Materials and Methods.When multiplied with matrix T, the two unary vectors, A and N, select the appropriate columns and rows, respectively,that define the base–amino acid energetic potentials of the interaction. Under the additivity rule, the total energy ofthese individual predicted energetic potentials is summed (circles). (b) Additional structural details of the interactioncan be imposed by terms of the connectivity matrix, C. In this matrix, the positions that are known to interact havethe value of 1 (shaded boxes) and all the others are set to 0. Multiplication of this matrix with the rest depicts onlythe relevant energetic potentials for the final sum (filled circles). The mathematical representation of this procedure(see equation (2)) is shown in (c).

    Probabilistic Protein-DNA “Recognition Code” 705

  • encoding are provided in Materials and Methods.Note that for a specific protein it is possible todetermine the “weight matrix”35 that predicts thebinding energy to all possible sites for that proteinby just combining the rows that correspond to thatprotein sequence, as shown in Figure 9 in Materialsand Methods. That allows one to easily calculatethe probability of the protein binding to anyparticular sequence. The same thing was done togenerate a weight matrix that described the energyof any particular DNA sequence to all possibleprotein sequences (not shown), which is used inthe calculation of probabilities for the phagedisplay data.

    Using data from SELEX randomisationexperiments

    Instead of measuring the relative energy of theinteraction of a systematic set of protein and DNAcombinations designed to obtain all of thedesired parameters, one could perform SELEX ran-domisation experiments with many (but not all) ofthe proteins. Each of the proteins is allowed toselect its preferred target(s) from a pool ofrandomised oligonucleotides. Of course, due tothe stochastic nature of the selection process, thetarget with the lowest binding energy (i.e. highestaffinity) might not be among those that are finallyselected. Nevertheless, we would expect that thetargets that will be selected would bind to theprotein with high affinity. Moreover, we canassume that the stronger that a base–amino acidcontact is, the higher the probability that it will beselected.

    The probability that a particular nucleotidetarget kN would be selected by the protein A willbe given by the same formula (equation (1)). Thistime, Pn(kN ) is the relative frequency of the selectedtarget kN in the oligonucleotide pool; the denomi-nator is, again, the partition function, now calcu-lated over all DNA target sequences present in theoligonucleotide pool. H(kN, A ) is the total bindingenergy potential of the DNA target to the proteinA, which assuming additivity over all contacts,can be calculated from the formula:

    HðkN;AÞ ¼Xijab

    CijAai T

    abij N

    bj ð2Þ

    where the variables are as described above (seealso Figure 2).

    Using data from phage displayrandomisation experiments

    In the previous section, we focused on theproblem of “DNA recognition” by a particularprotein. Similar arguments and formulae can bestated and written for the reverse problem: i.e.“protein recognition” by a given DNA target.Phage display randomisation experiments can beperformed, in which a fixed DNA target will be

    used to select some protein sequences that bind toit with high affinity, from a pool of randomisedproteins. In this case, the probability that a given(“fixed”) DNA sequence will select a particularprotein is given by:

    PðkAlNÞ ¼ PaðkAÞe2HðN;kAÞ=X

    k0Paðk0AÞe2HðN;k0AÞ

    !

    ð3Þ

    where now the sum in the partition function is cal-culated over all possible protein variants. Therandomised amino acid positions are usuallylimited to those assumed to make direct contactswith the bases.

    SAMIE: maximising the probability of the observedinteraction data

    Using the statistical mechanics theory justdescribed, the problem of modelling the protein–DNA interactions for a given protein family con-sists of estimating a number of parameters that cor-respond to position-specific energetic potentials ofsingle contacts. By assuming that the interactionsare additive, the problem is simplified significantly,because each contact will require 20 £ 4 ¼ 80 par-ameters to be modelled. For example, the model-ling of a single zinc-finger of the EGR familybound to a tetranucleotide target would consist ofspecifying the values of a weight matrix like thatpresented in Figure 3 (framed submatrix). If wedid not know the exact pattern of contacts and wehad to allow for any combination between thefour bases and the four amino acid residues, thenwe would have to use the “all-to-all” model andthe number of parameters would be16 £ 4 £ 20 ¼ 1280 (i.e. all the 16 submatriceswithin the framed area in Figure 3). Restricting themodel solely to the contacts known from the crys-tal structure28 reduces this number to 320 (i.e.“one-to-one” model, shown as shaded areas in theframed submatrix in Figure 3). The “many-to-one”model includes two additional contacts fromamino acid residues in the adjacent fingers to theoverlapping bases (i.e. the external shaded areasin Figure 3), but the number of parameters remains320. This is because the two additional contacts areidentical with those included in the “one-to-one”model (dotted arrows, Figure 3), so they can belinked during training.

    The algorithm we developed, SAMIE, estimatesthese parameters from SELEX and/or phage dis-play data. Essentially, SAMIE calculates the matrixT that maximises the probability (or log-probability) of observing the data. To this extent, itcan be viewed as a maximum likelihood estimatorof the interaction energy parameters. A detaileddescription of the algorithm is provided inMaterials and Methods.

    706 Probabilistic Protein-DNA “Recognition Code”

  • Results

    Training datasets and models calculatedby SAMIE

    In a preliminary report, we used solely SELEXdata from the EGR protein family to trainSAMIE.36 The training vectors in that set were con-structed according to the “one-to-one” model ofinteractions, consisting of the four “contacting”amino acid residues and their corresponding(four) target bases (see Figure 3).27,28

    In the present study, we use an expanded datasetof the same protein family. All data are collectedfrom the literature and stored in a database.36

    They are organised into six datasets that differ inthe type of data they contain (SELEX, phage dis-play or both) as well as the model of interactionsthey are built upon one-to-one or many-to-one(see Figure 3). Both models consider only the con-tacts that have been observed in the co-crystalstructure (Figure 3, shaded submatrices). Themany-to-one model consists of a superset of theone-to-one. For each finger modelled (HR), it con-tains two additional contacts: the amino acidresidue at position þ2 of the following finger(HRþ1) that contacts base 1 of the DNA target andamino acid residue þ6 of the preceding finger(HR21) that contacts base 4 (overlapping base). Thetwo additional contacts are linked to their “identi-cal” ones (see Figure 3, dotted lines) duringtraining.

    The purpose of the training on multiple sets is todetermine how different are the models derivedfrom SELEX data only or phage display data onlywhen compared to the models derived from thecombined datasets. Also, we would like to seewhether there is any difference in the models

    resulting from training on the one-to-one or many-to-many modes of interactions.

    The three datasets that are made according to theone-to-one model are named SELEX_4, PHAGE_4,and COMBINED_4, whereas those that are madeaccording to the many-to-one model are namedSELEX_6, PHAGE_6 and COMBINED_6. Thedetailed description of the construction of the data-sets is presented in Materials and Methods. SAMIEwas trained on each of these six datasets, resultingin six different weight matrices (or models) withthe base–amino acid energetic potentials for theEGR family. We call these matrices SAMIE_S4,SAMIE_P4, SAMIE_C4, SAMIE_S6, SAMIE_P6,and SAMIE_C6. The letters S, P and C are indica-tive of the type of data in the training set (i.e.SELEX, phage display and composite datasets,respectively) and the numbers of the model ofinteraction that was used; 4 for the one-to-one; 6for the many-to-one.

    In the following, we use the evaluation measuressuccess rate and specificity index that we define inMaterials and Methods and some others whereappropriate, in order to evaluate the ability ofSAMIE to describe protein–DNA interactions infive ways. (a) First, by predicting the data of itsown training set, we show that there is no internalinconsistency in the model. (b) Second, by predict-ing known in vivo EGR binding sites, SAMIE istested on its potentials as a “genomic scanner” forthe EGR proteins. (c) Third, predicting known invivo binding sites of other transcription factorsindicates the extent to which SAMIE constitutes agood representation for other Cys2His2 zinc-fingerproteins. (d) Fourth, the correlation betweenpredicted energy values with measured affinitiestests how well the frequency-derived weightsof SAMIE correspond to real energy potentials.It shows to what extent potential “docking

    Figure 3. Models of interaction.This weight matrix represents theparameters that SAMIE needs toestimate for the modelling of asingle finger target site (EGR pro-tein family). Each small sub-matrixconsists of 4 £ 20 energy valuesthat relate the amino acid residuesof a particular contacting positionin the protein with the bases in aDNA target position. The framedarea constitutes the full model ofthe four amino acid residues of asingle finger contacting four basesin the DNA target. The shaded sub-matrices correspond to the contactsobserved in the co-crystal structureof the EGR protein.27,28 The four

    shaded submatrices in the framed area (320 parameters in total) constitute the one-to-one model of interactions thatwe refer to in the text. The many-to-one model of interactions includes the two external shaded submatrices. Thesesubmatrices (i.e. contacts from the adjacent fingers) contain 80 parameters each (20 amino acid residues £ 4 bases).However, due to the fact that the additional contacts are (stereochemically) identical with others included in themain matrix (connected by broken-line arrows in the Figure), the number of parameters to be estimated remains4 £ 4 £ 20 ¼ 320 in this model too.

    Probabilistic Protein-DNA “Recognition Code” 707

  • rearrangements” might affect SAMIE’s predictions.(e) Finally, we compare SAMIE with the otherexisting qualitative and quantitative models toindicate the strengths and weaknesses of each one.

    Evaluation on the training sets

    There are three reasons that we are interested inevaluating SAMIE on its training sets (self-test).First, we would like to check whether thealgorithm is self-consistent. An algorithm that isself-consistent should be able to predict quite accu-rately, at least its own training set. The secondreason is that we would like to measure howmuch the various training sets and modes of inter-actions are affecting the predictive power ofSAMIE. Finally, we would like to select the modelthat performs best for further analysis (i.e. predic-tion of in vivo binding sites, prediction of bindingenergies, etc.).

    We use the parameters of each of these sixweight matrices to evaluate SAMIE on the corre-sponding SELEX and phage display trainingvectors. For each of the fixed sequences of thevectors in the evaluation set, all possible random-ised sequences were ranked according to theirprobabilities of selection, as they are calculated bySAMIE (equations (1) and (3); see also Figure 2),using the corresponding weight matrix. For theSELEX-derived vectors, an equal reference prob-ability of 0.25 was assigned for all bases; whereasfor the phage display vectors, the particular experi-ment-specific randomisation scheme was adopted.This is important, because some phage displayrandomisation schemes, such as VNN and VNS†,exclude certain amino acid residues from the selec-tion procedure. Others, like NNK and NNS, simplyalter the amino acid frequencies.

    The results of the self-tests are presented in Table1 and they show that all SAMIE’s models are self-consistent (SR0.1 values of 0.854–0.929). In general,training according to the one-to-one model of inter-actions seems to give slightly better predictions onphage display data, whereas the many-to-onemodel of interactions predicts the SELEX databetter. The training on the combined data setsresulted in predictions of the SELEX and thephage display datasets almost as accurate as thetraining on these sets individually. On average,the model of interactions we used for the trainingon combined datasets did not affect much the pre-diction capabilities of SAMIE (SR0.1 values of 0.907and 0.902 for the SAMIE_C4 and SAMIE_C6,respectively).

    The results show that there is no internal consist-ency in the method. Since the two SAMIE modelsthat are trained on composite sets (SAMIE_C4 andSAMIE_C6) perform about the same overall, we

    chose to use SAMIE_C6 for the subsequentanalysis, because we feel it is the more complete.The weight matrices of the four principal contactsof this model are presented in Table 2.

    Evaluation on known in vivo binding sites

    The database used to train SAMIE containsresults from in vitro selection experiments involv-ing the proteins of the EGR family, and it containsEGR protein–DNA binding pairs known to occurnaturally in vivo. The latter were not used in train-ing SAMIE. It is therefore of interest to see howwell SAMIE evaluates the known, in vivo bindingsites for the EGR family. Furthermore, we reportresults on how well SAMIE ranks known, in vivobinding sites for three yeast proteins that are nothomologues of EGR (i.e. MIG1, MIG2 and ADR1),but that contain two Cys2His2 domains each. Simi-lar analysis is performed with the Drosophila Cys2-His2 protein Tramtrack.

    For these tests, we use SAMIE_C6, which istrained on the combined dataset according to themany-to-one model of interactions. All rankingsare based on the probabilities as they are calculatedby SAMIE (equation (1)). For the prediction of thenatural binding sites of the yeast proteins, weused the GC content of the organism to define theprior probabilities, Pn, in equation (1). This reflectsa “protein’s view” of the genome, where thedifferent binding sites “compete” for that protein.Similar results were obtained when an equal prob-ability of 0.25 was used for all bases.

    Evaluation on known in vivo EGR-binding sites

    Apart from the SELEX and phage displayexamples, our database contains 24 in vivo DNA

    Table 1. Evaluation of SAMIE trained on various data-sets and two models of interactions: evaluation ofSAMIE trained according to the (A) one-to-one model ofinteractions, and (B) many-to-one model of interactions

    SAMIE model

    Evaluationset

    No. ofexamples SAMIE_S4 SAMIE_P4 SAMIE_C4

    A.SELEX 4 96 83 (0.865) n/a 82 (0.854)PHAGE 4 311 n/a 288 (0.926) 287 (0.923)Total 407 n/a n/a 369 (0.907)

    B.SAMIE_S6 SAMIE_P6 SAMIE_C6

    SELEX 6 99 92 (0.929) n/a 89 (0.899)PHAGE 6 279 n/a 256 (0.918) 252 (0.903)Total 378 n/a n/a 341 (0.902)

    The names of the evaluation sets are indicative of the type ofdata they contain (i.e. SELEX data only, phage display dataonly or both sets combined). The number of examples containedin each evaluation set is shown in the second column. The num-ber of the examples that the various SAMIE models ranked inthe top 10% of all the predictions is reported, together with theSR0.1 (number in parentheses).

    † Nucleotide representation according to IUPAC: V: Gor C or A; K: G or T; S, G or C; N: A or C, or G or T; thetriplet refers to the codon.

    708 Probabilistic Protein-DNA “Recognition Code”

  • Table 2. The energy matrix as it was calculated by SAMIE when trained on the COMBINED_6 dataset

    Finger position ¼ 21; base position ¼ 3 Finger position ¼ þ2; base position ¼ 4 Finger position ¼ þ3; base position ¼ 2 Finger position ¼ þ6; base position ¼ 1A C G T A C G T A C G T A C G T

    A 1.29 0.92 1.72 6.42 0.25 20.87 1.32 0.58 0.54 2.12 2.27 21.14 20.33 1.38 20.13 0.72C 5.21 4.71 6.29 4.51 3.86 20.93 6.92 4.94 4.16 1.89 6.27 20.08 4.68 5.07 5.11 5.39D 1.93 22.35 0.33 20.68 2.27 0.42 20.44 20.29 5.82 21.32 7.20 6.32 1.75 0.64 1.17 20.21E 0.34 21.06 20.05 20.98 6.34 21.72 1.88 0.29 1.30 0.43 1.98 1.16 0.09 1.68 1.26 20.08F 5.21 4.71 1.33 4.51 3.86 5.55 1.36 4.94 4.16 6.92 6.27 5.13 4.68 0.42 5.11 5.39G 2.45 0.44 1.07 0.47 0.23 21.58 0.39 20.55 0.27 1.43 2.27 20.51 2.62 2.62 1.61 1.54H 1.41 20.10 1.42 20.85 20.38 20.99 0.20 21.78 0.28 2.64 21.31 6.72 1.73 6.10 20.08 0.33I 1.55 1.58 1.16 1.77 1.05 5.55 8.45 5.90 5.68 2.58 7.23 1.23 1.81 1.77 5.78 6.46K 0.33 20.37 21.04 21.14 6.59 20.94 2.70 6.47 5.54 7.47 0.77 1.04 0.87 6.99 21.01 21.38L 1.18 0.19 7.89 20.46 7.05 6.63 3.50 1.14 6.93 1.54 8.16 0.89 1.09 2.44 0.79 7.94M 6.52 4.99 6.55 21.55 6.33 5.55 2.41 0.28 0.26 7.32 2.06 20.29 6.91 5.71 5.48 6.27N 0.04 0.73 0.06 21.33 1.14 21.72 0.75 1.21 23.73 0.86 20.005 20.80 20.81 0.48 21.00 1.75P 7.33 5.89 7.36 0.75 7.01 0.17 2.27 6.58 6.20 8.16 2.85 1.73 7.70 2.18 0.21 0.74Q 22.50 6.32 0.72 20.52 0.20 21.32 1.78 0.30 20.26 1.34 0.77 0.35 20.92 6.10 20.08 0.33R 1.01 3.26 21.36 0.46 1.66 21.05 2.11 7.37 6.59 3.54 2.56 2.14 0.38 0.55 23.33 1.29S 1.06 0.73 0.61 20.57 2.29 21.45 0.28 20.91 6.06 1.15 1.69 21.65 20.21 0.03 20.40 20.39T 0.27 1.72 0.59 21.33 20.18 21.52 1.29 1.23 7.50 20.04 3.26 0.26 21.19 0.20 20.63 20.46V 7.59 6.70 1.57 0.92 7.01 0.17 1.76 0.57 8.01 0.79 8.52 0.80 1.13 1.98 0.46 0.78W 2.21 4.71 6.29 4.51 3.86 20.52 0.67 20.71 4.16 6.92 6.27 20.08 4.68 5.07 5.11 20.27Y 5.21 4.71 20.05 4.51 4.95 21.25 1.43 5.31 4.16 6.92 1.27 5.13 21.37 5.85 5.23 5.61

    The COMBINED_6 dataset consists of both SELEX and phage display data, according to the many-to-one model of interactions. The energy values presented here have been normalised so thatthe average binding constant for each contact is 1.0. This normalisation does not affect the calculation of probabilities, as described in the text. The normalised matrix shows that the energeticpotential of a base–amino acid contact varies, depending on its position in the DNA target and the protein.

  • targets (10-mers) of the EGR proteins. These dataare not included in the training sets, partly becausethe choice of prior probabilities for both the basesand the amino acid residues would be arbitrary.Furthermore, SAMIE was trained on the 4 bp(sub)targets and not on the complete sequences.Nevertheless, parts of these sequences had beenrecovered in SELEX or phage display experiments.The 24 naturally occurring binding sites consist of31 different tetranucleotides, 14 of which areincluded in the COMBINED_6 training set.

    SAMIE is able to rank 21 out of the twenty-four10 bp natural sites at positions 1–2000 (i.e. in thetop 0.2% of all possible 10 bp targets). The remain-ing three are sites that have been found in thepromoter regions of the human basic fibroblastgrowth factor37 and the human interleukin 2gene,38 and they rank at positions 11,766, 51,436and 62,580 (or in the top 6% of all possible targets).It is of some interest that at least one of these threesites is not unambiguously a target for the EGRgenes. In the case of the site found in the promoterof the human basic fibroblast growth factor,37 therewere two additional putative target sites in theproximity of the genomic region, and theseadditional sites were ranked by SAMIE_C6 atpositions 258 and 661, respectively.

    Evaluation on known in vivo MIG-binding sites

    In yeast, there is no EGR homologue. However,yeast has a number of proteins that contain theCys2His2 zinc-finger domain. Transcription factorsMIG1, MIG2 and ADR1 are three known examples.All three of them contain two fingers that are quitesimilar to those of the EGR. However, there is noglobal similarity between the yeast proteins andthe EGR, or between the MIG proteins and ADR1.On the contrary, there are some major differences:(1) the three yeast proteins contain one zinc-fingerdomain less than the EGR; (2) the domains arelocated in the amino terminus of the yeast proteins,whereas EGR has them at the carboxy terminus;(3) the ADR1 is twice the size of the others and itis known to require an additional 20 aminoacid residues to recognise its DNA targetefficiently.39

    The MIG transcription factors are known toregulate a number of yeast genes. One of their pri-mary targets is SUC2-A and it has been shownthat the MIG proteins bind to its promoter regionwith high affinity. The target site for the SUC2-Apromoter is believed to be 50-GCGGGGA-30.40

    MIG1 and MIG2 are identical with respect to theamino acid residues at the “contacting” positionsof the two fingers. Assuming that the two MIGproteins, similar to EGR, bind the DNA in an anti-parallel fashion, we used SAMIE_C6 to predicttheir composite 7 bp long target sequence. Theweighted LOGO of their possible target sites wascalculated by program ENOLOGOS (P.V.B. et al.,unpublished results) and is presented in Figure 4.ENOLOGOS uses the algorithm presented by

    Schneider et al.,41 but each sequence is weightedaccording to the predicted probability ofbinding.

    Compared to the binding site found in the pro-moter of the SUC2-A gene, SAMIE predicts cor-rectly all but the last nucleotide position (where itpredicts C or T instead of A). The 6 bp (sub)se-quence 50-GCGGGG-30 of the natural site wasSAMIE’s topmost prediction (out of a total of 4096possible sites). Lutfiyya et al. reported the resultsof SELEX experiments that they performed usingthese proteins.40 These results agree with SAMIEthat A is anti-selected at the last position and theyshow that MIG proteins prefer G in this position,with T or C being their second preference. Thisobservation raises the point that other factors(besides the affinity) might play an important rolein the in vivo selection of the binding sites. Thetranscription factors might recognise genomicsequences that do not exhibit the highest affinity,although one would not expect them to differ toomuch from the highest-affinity ones. The SELEXresults show that the second position is the leastconserved; an observation that agrees with predic-tions made by SAMIE. In the same paper, Lutfiyyaet al. reported the binding sites for three othergenes that are regulated by MIG1 and/or MIG2.The measured binding affinity of MIG proteins tothese sites is not as high as to that of SUC2-A. Allthe three sites are predicted to be in the top 5% ofSAMIE’s list of targets.

    Evaluation on the known in vivo ADR1-binding site

    We repeated the above analysis for the bindingsite of the ADR1 protein, but the results were notas good as with the MIG proteins. In particular,although the assumed binding site in the regulat-ory region is 50-TTGGAGA-30, our consensussequence matches only four of these seven

    Figure 4. The weighted LOGO of the yeast zinc-fingerproteins MIG1 and MIG2. This LOGO plots all possiblenucleotide targets, weighted by the probability of inter-action, as it is calculated by SAMIE. The naturallyfound binding site with the highest affinity to the MIGproteins is 50-GCGGGGA-30.40

    710 Probabilistic Protein-DNA “Recognition Code”

  • nucleotides (i.e. positions 2–4 and 6). Yet, the natu-ral binding site ranked at position 1151 or in thetop 7% of all possible 16,384 heptanucleotides.NMR studies on ADR1 (free and bound to specificDNA) indicate that this transcription factor utilisesonly amino acid positions 21, þ3 and þ6 of finger1 and amino acid position 21 of finger 2 to targetthe 50-GGAG-30 subsite. According to SAMIE_C6,this is the second preferred subsite of this protein(among 256 possible; the first has G instead of Aat the third position).

    The fact that SAMIE is not able to predict thecomplete ADR1-binding site effectively may reflectthe strong deviation from the EGR pattern of con-tacts. In fact, it is known that a region amino-terminal to the first finger is essential for thebinding of the protein.42 Nonetheless, SAMIEpredicts correctly the 4 bp subsite for which thezinc-finger contacts are conserved. This observ-ation indicates that its underlying model (or recog-nition code) should be consistent and itspredictions could be useful even for more distantlyrelated zinc-finger proteins.

    Evaluation on the known in vivo Tramtrack-binding site

    Tramtrack is another member of the Cys2His2protein family and it is probably one of the beststudied genes in Drosophila. It regulates thedevelopmental gene fushi-tarazu. Like the MIGand the ADR1 proteins, it contains two zinc-fingers. Tramtrack provides an interesting testcase, mainly because its co-crystal structure hasbeen solved and we know that its contactingscheme is slightly different from that of the EGR.43

    In particular, the amino acid position þ6 of finger2 as well as the amino acid position 21 of finger 1do not participate in the binding. The base atposition 6, which is normally contacted by theamino acid residue at position 21 of finger 1, inthe case of Tramtrack is contacted by the aminoacid residue at position þ2. The protein in thecrystallised complex is bound to the sequence 50-AGGAT-30, which is contained in the promoterregion of the fushi-tarazu gene. The heptanucleotidesequence around this region is 50-AAGGATA-30.

    We used SAMIE_C6 to predict the 7 bp bindingsite of Tramtrack under an assumed pattern of con-tacts identical with that of the EGR. The resultingweighted consensus sequence is presented inFigure 5. The prediction agrees with the naturallyoccurring binding site (i.e. 50-AGGAT-30)43 and theagreement even extends to the base that isupstream of that in the genome. The prediction dis-agrees with the base downstream of the bindingsite in the genome (SAMIE_C6 predicts C or Tinstead of the naturally found A). But the crystalstructure shows that this last base does not interactwith any amino acid residue. Nonetheless, thenaturally found 7 bp site ranked at position 67 ofSAMIE’s list (or in the top 0.4%).

    Interestingly, Figure 5 depicts the subsequence50-AGGA-30 (in the middle) as the part with thestrongest signal. These are the exact base positionswhere the two fingers have an identical pattern ofcontacts to the EGR protein. The first base positiondoes not exhibit any strong preference, agreeingwith the observation that amino acid position þ6of finger 2 does not contact the base at this position(which, in fact, happens to be A in the natural site).Similarly, the crystal structure shows there is noamino acid residue contacting the last base (pre-dicted to be C or T by SAMIE; naturally found tobe A).

    All the above observations show that SAMIE canpredict the natural-binding site of a protein veryaccurately, even if it deviates slightly from itslearned pattern of contacts. It indicates that minordocking rearrangements, although they change theoverall pattern of contacts, can still allow for areasonable prediction of a binding site.

    Correlation with measured energy data

    One of SAMIE’s characteristics is that it associ-ates the frequencies of the observed contacts to thebinding energies of the corresponding interactions.If we assume that the weight matrix that is calcu-lated by SAMIE corresponds to the real energeticpotentials of the base–amino acid contacts†, thenwe would expect that the measured binding con-stants, KA, will be related to the energy values, H,

    Figure 5. The weighted LOGO of the Drosophila zinc-finger protein Tramtrack. This LOGO plots all possiblenucleotide targets weighted by the probability of inter-actions, as it is calculated by SAMIE_C6. Interestingly,TTKB recognises only the 50-AGGAT-30 subsequence invivo.43

    † There are two points worth noting, here. First,equations (1) and (3) lack the temperature parameter thatis present in the Boltzmann distribution: i.e. we assume aconstant temperature for all experiments. Second, theenergy values of each base–amino acid contact havebeen adjusted to an arbitrarily defined zero point; butthis does not alter the probabilities calculated from theseequations.

    Probabilistic Protein-DNA “Recognition Code” 711

  • according to the formula:

    KA / e2H ð4Þ

    Thus, if SAMIE’s values correspond to the realbinding energetic potentials, then the correlationcoefficient between the measured association con-stants and SAMIE’s predicted probabilities shouldbe high. There are a number of examples in theliterature where binding constants have beenmeasured for EGR-derived proteins and variousDNA targets, and we use them for furtherevaluation of SAMIE.

    We generally use the correlation coefficientstatistic to measure the agreement of the observedversus the predicted values. When the measureddata are limited and they consist mainly of high-affinity pairs, then the correlation of the measuredversus predicted energy is evaluated. When bothhigh-affinity and low-affinity pairs are included,though, the evaluation will be done on the corre-sponding probability (or, equivalently, KA) values.The reason we prefer to use the probabilities inthis case is that it is not very important for a pre-diction method to be accurate in the low-affinitystates, which have large energies and diminish thecorrelation coefficient unduly.

    Comparison with the data from Hamilton et al.44

    As a first test, we use the data reported byHamilton et al. on the wild-type EGR1 proteinbound to various DNA targets.44 Should ourmodel be consistent, one would expect that it willbe able to predict well at least the specificity of thewild-type protein to these targets. Hamilton et al.reported the ratios of the dissociation constants to15 DNA targets relative to the wild-type DNAtarget. We use these ratios to calculate the corre-sponding energy differences (see equation (4)).Then we calculate the predicted energy differencesfor the same targets, using the SAMIE_C6 weightmatrix (Table 2). We find that the correlation coeffi-cient between the two sets of values is R ¼ 0.75,P , 0.01. This represents quite a good fit to thedata, especially given the fact that accuratemeasurement of KD was not possible for some ofthe targets.44 The R value for the KA terms onEGR1 is 0.69.

    In the same paper, Hamilton et al. reported theratios of the dissociation constants for the WT1wild-type protein to 18 DNA targets. WT1 proteinbelongs to the Cys2His2 zinc-finger family but,compared to EGR, it has one additional finger.The sequence of the three last fingers of WT1 isvery well conserved compared to the three fingersof the EGR1 proteins. We repeated the aboveanalysis using the WT1 data and we found thatour predictions correlate strongly with them too(R ¼ 0.76, P , 0.001). The R value for the KA termson WT1 is 0.65.

    The strong correlation between SAMIE’s predic-tions and measured energy values for the two

    wild-type proteins and various targets is indicativeof the potential of our model. Indeed, it seems that,given sufficient training data, which are entirelyqualitative, SAMIE can determine a quantitativemodel, or P-code, that can predict the correspond-ing energy values fairly well.

    Comparison with the data from Segal et al.45

    We extend the analysis in predicting energyvalues for variants of the EGR family. Segal et al.reported the KD values for a number of suchvariants.45 For nine of these proteins, KD wasmeasured for two alternative DNA sites. In twocases, the selection data in the training set contra-dict the measured values directly or they are notsufficient to model the corresponding interactionsaccurately. In the case of the protein srsddlvrbound to gcg and gag targets, the measured KDvalues are 9 and 6, respectively. In other words,this protein binds the same or stronger to the lattertarget. According to the crystal structure, thismeans that the Asp at position þ3 of the fingercontacts the A in the middle position the same ormore strongly than a C at the same position. Ourcombined dataset that was used for the training ofSAMIE (i.e. COMBINED_6) contains 82 exampleswith Asp at position þ3 and a C is found in thesecond base position in all cases (Table 5). We donot know the source of this discrepancy betweenthe data from the randomisation experiments andthe measured values. It may be that the particularprotein variant deviates strongly from the patternof contacts observed in the crystal structure. Inany case, we excluded this example from furtheranalysis.

    In the second case (i.e. protein srsddlvr bound toggg and gtg), the relative KD value between the twotargets is greater than 233 (the reported KD valuesare 6 and .1400, respectively). That order ofpreference agrees with our dataset, where Lys atposition þ3 clearly favours G over T in the middleposition (14 G versus one T), but the magnitudecannot be estimated accurately from our currenttraining set. Thus, this example was excludedfrom further analysis.

    For the remaining 14 sequences (seven pairs)with reported KD values, the difference betweenthe experimentally measured relative energies andthe SAMIE’s predictions correlate very well(R ¼ 0.82, P , 0.025). The R value for the KA termson all sequences (not pairs) is 0.79.

    Comparison with the data from Miller & Pabo16

    More recently, Miller & Pabo used the wild-typeEGR protein and the mutant D20A to measure therelative binding affinity for the trinucleotidetargets GNG and GCN.16 D20A has an Asp to Alareplacement at position þ2 of finger 1 with respectto the wild-type protein. The amino acid replace-ment is expected to affect the overall bindingaffinity of the protein due to the contact of the

    712 Probabilistic Protein-DNA “Recognition Code”

  • replaced amino acid to the overlapping base (G).However, if the two proteins contact the DNA inthe same way (i.e. the pattern of contacts for theD20A is that observed in the crystal structure),then the replacement should not affect the speci-ficity of this protein to the DNA targets used inthe study. This is because, according to the crystalstructure, the nucleotide positions that vary arenot contacted by the amino acid at position þ2 offinger 1. Consistent with that, SAMIE_C6 predictsthat the Asp to Ala replacement at position þ2 ofthis finger should result in a sixfold increase ofthe dissociation constant for all seven studied tar-gets. However, the KD values reported in thatpaper for the two proteins are practically thesame. This can be explained if one assumes thatthe amino acid residue at position þ2 of finger 1does not contact the base at position 10 of theDNA target, thus not contributing to the bindingenergy at all. In fact, the refinement of the crystalstructure has shown this amino acid position to beat marginal distance from base position 10 withrespect to hydrogen bond contacting potentials.28

    Furthermore, their data show that protein D20Ahas a considerably low dissociation constanttowards DNA target GCT. This fact cannot be pre-dicted by a simple recognition code if one assumesa conserved pattern of contacts. The authors alsosolved the co-crystal structures of the D20A proteinbound to two different DNA targets (GCG andGCT). They found that, overall, the D20A main-tains the same contacting scheme, although theyfound a less ordered complex around bases 10and 11 (note that base 10 is the base that isexpected to be affected most by the Asp to Alareplacement). The authors conclude that their find-ings cannot be explained by a “simple recognitioncode”.

    We use equation (4) to predict the associationconstants for the two proteins and the seven DNAtargets, using SAMIE’s energy values (Table 2;Figure 2). Then we compare these predictionswith the observed values (i.e. association con-stants) and we find them to be highly correlated(R values of 0.93 and 0.81 for the wild-type andD20A protein, respectively; P , 0.01 in all cases).This might seem surprising; however, we notethat the published KD values for the two proteinsare also highly correlated (R ¼ 0.87, P , 0.01). Infact, apart from the GCT target, the two proteinsagree very well on their KD values (R ¼ 0.94,P , 0.005). The R value for the energy terms onD20A is 0.62; on Zif268 is 0.56, and on both is 0.58.

    Comparison with the data from Bulyk et al.19,46

    Determination of the dissociation constants ofvarious EGR-derived proteins against all possibletrinucleotide targets has been performed with theuse of microarray technology.46 In this study, thewild-type EGR protein and four variants withamino acid substitutions on the middle finger ofthe protein were used to bind to a microarray that

    contained all possible trinucleotide targets for themiddle finger. Assuming that the intensity of thesignals corresponds to the affinity of the inter-actions, the authors calculated the dissociation con-stants using these intensity values. We analysedthese data and compared their results withSAMIE’s predictions.

    In their first study,46 there are ten targetsreported for the wild-type protein, but only sevenof them exhibited high affinity (according to theauthors, the other three that were reported exhibitlower affinity and/or they were used as negativecontrols). SAMIE_C6 ranks the first six of theseven high affinity targets at position 8 or higher(out of the total 64 possible trinucleotides). The cor-relation coefficient between the energies predictedby SAMIE_C6 and the logarithms of the tenmeasured KD values is 0.61, P , 0.05; the conver-sion of the KD values to energies was done accord-ing to equation (4). This result also shows verygood correlation between experimental energymeasurements and SAMIE’s predictions for thehigh-affinity sites. Furthermore, if one comparesthe measured KA values for the entire set of 64triplets19 with the predictions from SAMIE, the cor-relation is 0.61. While not as high as some of theexamples described previously, this is still a verygood correlation between measured and predictedbinding constants over the entire range of highestto lowest-affinity sites.

    One of the four mutant proteins analysed in thatpaper is RGPD (the amino acid residues corre-spond to positions 21 to þ3 of the helix of thesecond finger). The authors reported six high-affinity DNA targets for this protein in their firststudy. The first five of them rank at positions 1–5according to SAMIE, although SAMIE’s rankingorder is different from that reported. The corre-lation coefficient, calculated like before over themeasured and predicted energy data is 0.73,P , 0.10. The LOGO of all binding sites weightedby SAMIE’s probability of interaction (Figure 6) isvery similar to the one presented in that paper.The correlation of the measured KA values to thosepredicted for all 64 triplets is 0.99. Notably, finger2 of the protein RGPD (i.e. the finger under study)differs from the corresponding finger of the wild-type protein in all “contacting” amino acid resi-dues, except position 21 of the helix (Arg). Thefact that SAMIE can predict the binding affinitiesof this mutant protein to the entire collection oflow and high-affinity sites extremely well demon-strates the potential of the probabilistic algorithmsfor modelling the protein–DNA interactions.

    The results of the analysis are similar for themutant protein REDV. The authors report the KDvalues for six targets,46 but only two of them reallyexhibit high affinity (GCG and GTG). SAMIE_C6ranks these two targets at the two topmostpositions of its list. Furthermore, we find thatSAMIE’s predicted energy values for the sixreported targets correlate very well with theexperimental values; R ¼ 0.80, P , 0.05. The six

    Probabilistic Protein-DNA “Recognition Code” 713

  • reported targets ranked in the top nine positions ofSAMIE’s list. We calculated the LOGO of allpredictions (weighted by SAMIE’s probabilities;Figure 6) and we find it to be very similar to thatpresented (weighted by relative affinity of thehigh-affinity targets; see Bulyk et al.46).Furthermore, the correlation between measuredand predicted KA values for all 64 triplets is0.73.

    The LRHN shows less specificity than the others,with 13 high-affinity sites. Even so, the predictedconsensus sequence matches that derived from theexperiments, TAT, and the overall correlationbetween predicted and measured KA values is 0.56.

    The protein KASN shows practically nospecificity, and no consensus binding sequenceemerged from the experiments.46 All of themeasured KD values were at least 83 times higherthan that of the wild-type protein to its preferredsite. Not surprisingly, SAMIE does not predict thebinding affinities well in this case, with a corre-lation of only 0.16 for all 64 triplets. This result isconsistent with the notion that the predictionsfrom SAMIE are best for those proteins with highspecificities, which includes all natural transcrip-tion factors.

    Comparison of SAMIE with other methods

    In this section, we compare SAMIE’s predictionaccuracy with that of other quantitative and quali-tative models in order to assess its strengths andits possible weaknesses. For the evaluation, weuse the SELEX and phage display dataset(s).

    Comparison with other quantitative models

    We compare SAMIE_C6 with the two currentlyavailable quantitative models: that presented bySuzuki and colleagues10,47 and that from Margalit’sgroup.7,13 A third approach, followed by Kono &Sarai,12 is very similar to that of Margalit’s,7,13 inthe sense that the scores for the base–amino acidcontacts are calculated to be proportional to thelogarithm of the frequencies in the training set.The training sets for the two groups consist of 53(Margalit’s) and 52 (Sarai’s) non-redundantco-crystal structures. The main difference is thatMargalit’s group is considering contacts betweenbases and amino acid side-chains only, whereasSarai’s model considers also the contacts to theDNA and protein backbone. However, whenexamining the additional data provided by Kono

    Figure 6. Weighted LOGOs of the DNA-binding sites of two mutated EGR fingers. The fingers correspond to proteinvariants RGPD and REDV presented by Bulyk et al.46 For the LOGOs, we plot all possible nucleotide targets, weightedby the probability of interaction as it is calculated using the SAMIE_C6 weight matrix. These two LOGOs are verysimilar to those presented by Bulyk et al.46 (and reconstructed here) that are weighted by relative binding affinity.

    714 Probabilistic Protein-DNA “Recognition Code”

  • & Sarai,12 we did not find any significant basepreference of the amino acid residues towardsDNA in them. Another difference between Sarai’sand Margalit’s models is in the normalisation ofthe base–amino acid frequencies. Kono & Saraiuse the amino acid frequencies in the training setas prior probabilities, whereas Margalit’s groupderives theirs from the SWISS-PROT database.Unlike Suzuki’s and Margalit’s groups, Kono &Sarai do not provide the final scoring matrix fortheir method. Thus, it is very difficult for us toevaluate their model and compare its predictionswith those made by SAMIE.

    For the comparison of SAMIE with the other twomethods, we use the SAMIE_C6 weight matrix andthe latest models published by these groups. We(re)form their data into weight matrices equivalentto SAMIE’s, according to the many-to-one modelof interactions. Then, for each vector in theSELEX_6 and PHAGE_6 datasets, we rank allpossible variable sequences with respect to theirscore towards the fixed sequence (see Figure 9).The score is calculated as the sum of scores of indi-vidual base–amino acid contacts. For SAMIE_C6,this score represents an estimate of the bindingenergy of the interaction, which constitutes one ofthe advantages of our method. The specificityindex and the success rate are calculated as before(see Materials and Methods). The only differenceis that in this case the corresponding rankings arebased on the sum of weights instead of the prob-ability values. This results in an underpredictionof the phage display data (e.g. SAMIE’s SR0.1value drops to 0.875 from 0.903; see Table 1). Thisis unavoidable, though, since the other two modelsdo not provide a method for calculating the bind-ing probabilities. The reason for the underpredic-tion is that ranking phage display data accordingto score treats all amino acid residues as equi-probable. This is not true, even if the randomis-ation scheme for each amino acid position in thephage display experiments was NNN, where Nstands for A or C or G or T and each N refers to acodon position. The results are presented in Table 3.

    In terms of success rate, Suzuki’s model is pre-dicting the SELEX data better than the other twomethods. It ranks correctly 88 vectors in the top10% of their list (SR0.1 ¼ 0.917). In comparison,SAMIE_C6 ranks correctly 86 vectors(SR0.1 ¼ 0.896) and Margalit’s model35(SR0.1 ¼ 0.365). However, on average, SAMIE_C6has a better specificity index (95.4 compared to91.2 of Suzuki’s model and 80.9 of Margalit’smodel). On predicting phage display data,SAMIE_C6 out-performs the other two methodswith SR0.1 value of 0.875 and SIavg of 96.3. Interest-ingly, although Suzuki’s model has a higher suc-cess rate than Margalit’s on phage display data,the latter has better average specificity indexvalue. Finally, on the overall performanceSAMIE_C6 is clearly better than the other two.Suzuki’s model has a better SR0.1 value thanMargalit’s (0.696 compared to 0.517), but they bothhave about the same average specificity (SI of 85.6for Suzuki’s model compared to 84.2 for Marga-lit’s).

    SAMIE_C6 and Suzuki’s model predict SELEXdata with similar high levels of accuracy. This isinteresting, since SAMIE is a data-driven statisticalmechanical approach, whereas Suzuki’s model isbased on the chemical and stereochemical proper-ties of the base–amino acid contacts. The fact thatthe two can predict well the SELEX data supportsthe idea that common principles might exist thatgovern these interactions; and, if so, we might beable to formulate them in a mathematical way.This is supported further by the fact that the sub-matrices of SAMIE_C6 and Suzuki’s model for thecontact between amino acid position 21 and baseposition 3 are correlated (R ¼ 0.56, P , 0.10)†. Thesame is true (although the R value is smaller) forthe weight submatrices that correspond to thecontact between amino acid position þ3 and baseposition 2 (R ¼ 0.41, P , 0.05). These are the onlytwo single contact positions (see Figure 1).

    Suzuki’s model for the EGR protein family pre-dicts that the submatrices of certain base–aminoacid contacts should be related. In particular, thesubmatrix for amino acid position 21 of the helixshould be identical with that of position þ6 (bothcontacts involve large amino acid residues only).According to SAMIE_C6, the weight matrices forthese contacts have the highest correlation coeffi-cient value (i.e. R ¼ 0.4, P , 0.005). Although theprobability value is low, the correlation coefficientis not high enough to consider these two contactsidentical. All coefficients between the submatricesof SAMIE_C6 that correspond to the modelled con-tacts show positive correlations (R values between0.18 and 0.4, and P , 0.05 and ,0.001). Thisreflects the fact that each amino acid has prefer-ences for only certain bases, and vice versa, but the

    Table 3. Prediction scores of the available quantitativemodels

    SAMIE_C6 SUZUKIMARG-ALIT_01

    Evaluation set SR0.1 SI SR0.1 SI SR0.1 SI

    SELEX 6 0.896 95.4 0.917 91.2 0.365 80.9PHAGE 6 0.875 96.3 0.620 83.7 0.570 85.3COMBINED 6 0.880 96.1 0.696 85.6 0.517 84.2

    Currently available quantitative models are compared interms of their ability to predict SELEX and phage display data.The models are: (a) SAMIE_C6, (b) the one presented by Suzukiand colleagues10,47 and (c) the one from Margalit’s group. For theevaluation of the accuracy of the predictions the success rateand the specificity index (mean value over all predictions) areused, as they are defined in Materials and Methods.

    † The correlation coefficient was calculated only on thebase–amino acid pairs that Suzuki’s model permitscontact.

    Probabilistic Protein-DNA “Recognition Code” 715

  • correlations are low because each position also hasdifferences in their contacts. Those similarities anddifferences are embodied in Suzuki’s chemical andstereochemical merit points and allow it to do afairly good job of predicting SELEX results. ButSAMIE has the advantage of optimising the par-ameters of the model on the basis of experimentaldata.

    We note that Margalit’s model, by design, doesnot take into consideration the position-variationof the base–amino acid contacts. It consists of asingle weight matrix, which is used to model anycontact, regardless of its position in the protein/DNA. We believe that this “averaging over all con-tacts” constitutes one of the limitations of thismodel and is responsible for the somewhat lowerperformance compared to the other two.

    Comparison with the qualitative model(s)

    We implemented the qualitative model for theEGR protein family, which can be viewed as a listof position-specific “permissible” base–amino acidcontacts. This model was initially presented byChoo & Klug15,31 and Pabo and colleagueslater.14,16,34 The models proposed by these twogroups, however, are different with respect to theset of observed/permissible contacts they include.In fact, the contacts included in both models areless than half of the total number. Giving the“benefit of the doubt” to the qualitative modelling,we used the composite set of permissible base–amino acid contacts (Table 4). This set is based onthe Tables presented by Choo & Klug15 and byMiller & Pabo,16 respectively. We tested this modelon our COMBINED_6 training set and we foundthat it can predict only 61.1% of all observedbase–amino acid combinations (i.e. 1228 out of thetotal of 2009). In other words, about 40% of theexperimental data included in our set are not cata-logued in either of the two qualitative models. We

    must emphasise, though, that the combinationscontained in the COMBINED_6 dataset are only“inferred” contacts; that is, if one assumes a pat-tern of contacts identical with that found in theEGR co-crystal structure.28 Of course, this is notthe case for all of them. For example, in somecases the amino acid residue might be too faraway to form a bond with the base. However, thefact remains that all of these combinations areobserved in selection experiments, yet the quali-tative models were not able to predict them.Finally, we found that the composite qualitativemodel is able to predict correctly only 195 of the820 single-finger examples (or 23.8%) in theCOMBINED_6 set. In this case, we consider a pre-diction “correct” if all observed base–amino acidcombinations are listed in the composite quali-tative model; each of the individual models doesworse than this. This shows that the majority ofinteracting fingers and sites obtained in in vitroselection experiments are not accounted for by thesimple, qualitative models. In contrast, the quanti-tative P-code model allows for all possible com-binations of fingers and sites, and the predictedrankings match the observed data quite welloverall (Table 3).

    Discussion

    The present study addresses the problem ofmodelling protein–DNA interactions. Our thesisis that a probabilistic “recognition code” (orP-code) can model such interactions accuratelyenough for prediction purposes. We develop aprobabilistic algorithm, SAMIE, that can “learn”the contact-specific, base–amino acid energeticpotentials from data derived from randomisationexperiments (SELEX and/or phage display).

    The underlying theoretical model of SAMIE isbased on statistical mechanics theory. According

    Table 4. The composite qualitative model

    Position in DNA

    1 2 3 4

    C b P C b P C b P C b P

    A Q N S Q AH

    C L D D STV

    G S R K H K R D ST F

    T S K A T Q N L S DT S T

    V

    The qualitative model can be viewed as a list of position-specific, permissible base–amino acid contacts. Two qualitative modelshave been proposed so far: one by Choo & Klug15,31 and another by Pabo and colleagues14,16,34 This Figure presents the permissiblecontacts that each group proposes. The contacts proposed by one of the groups only are in the columns marked C and P for Choo’sand Pabo’s models, respectively. The contacts that are present in both models are in the columns marked with b. Interestingly, lessthan half of the total number of valid contacts are proposed by both groups.

    716 Probabilistic Protein-DNA “Recognition Code”

  • to our model, the probability that a fixed com-ponent (protein or DNA) will select a givenvariable component (DNA or protein, respectively)from a pool of randomised molecules is defined bythe Boltzmann distribution (equations (1) and (3)).The energies constitute a weight matrix, whichassociates every base at a particular DNA targetposition with every amino acid at the correspond-ing “contacting” position(s) of the protein. Thevalues of the weights in this matrix (i.e. theparameters of the model) are estimated from thebase–amino acid frequencies observed inrandomisation experiments (SELEX and/or phagedisplay), following the steepest ascents method.SAMIE constitutes essentially a maximum likeli-hood approach for the estimation of the parameters(see equation (6)). The exact “contacting” aminoacid positions need not be known a priori, althoughknowledge of them helps to reduce the number ofparameters of the model. Here, we restrict thetraining to those contacts that have been observedin the co-crystal structures of the EGR protein.27,28

    We further make the assumption that the base–amino acid interactions are additive over all con-tacts. In other words, we assume that the contactsare independent and therefore contribute addi-tively to the total binding energy. We know thatthis assumption is not exactly valid,18,19 but evenin those studies the data show that additive modelscan be very good approximations.20 For practicalpurposes, it is usually sufficient for it to hold onlyfor the high-affinity states, where it tends to holdfairly well.17,20 Nevertheless, our model is fairlygeneral and it can represent both additive andnon-additive interactions adequately at the cost ofan increased number of parameters that needed tobe estimated.

    The performance of the model derived fromSAMIE was evaluated on several different datasets.

    Self-test

    The training on each of the six datasets resultedin a different SAMIE model. Each of these sixmodels is tested initially on predicting their train-ing datasets. All SAMIE models were able to pre-dict “correctly” 85% or more of the correspondingtraining vectors. In this context, correctly refers tothe success rate that we define in Materials andMethods. Practically, it means that in 85% of thetraining vectors, the randomised counterpart ofthe vector was ranked by SAMIE in the top 10%of all possible randomised targets. Of course, test-ing performance on the training data runs the riskof over-fitting the parameters and obtaining arti-ficially good results. We do not think that is aproblem, because the number of examples greatlyexceeds the number of parameters; in theCOMBINED_4 and the COMBINED_6 datasetsthere are 2009 and 3612 base–amino acid combi-nations, respectively, and the SAMIE_C4 andSAMIE_C6 models have only 320 parameters.Nevertheless, the most important performance

    evaluations are on data not included in the trainingset.

    The results of the self-test prove that our modelis internally consistent, and confirm our notionthat training on composite dataset(s) (i.e. SELEXand phage display vectors) results in a generallybetter model.

    Evaluation on naturally found binding sites

    We tested the ability to predict the naturallyfound binding sites of EGR as well as other zinc-finger proteins. Out of the twenty-four 10 bp longin vivo DNA sites present in our database for theEGR protein, SAMIE was able to rank 21 in thetop 0.2% of all potential binding sites. The remain-ing three putative target sites rank only in the top6%, but at least one of those is unconfirmed andthere are higher-ranked alternative sites in theproximity in the genome.

    We tested the prediction ability for proteins thatare not related to the mammalian EGR family,apart from the fact that they all contain the Cys2-His2 motif. We used three well-characterisedproteins, the yeast transcription factors MIG1/MIG2 and ADR1 and the Drosophila TTKB(Tramtrack). All of these proteins contain twozinc-fingers (compared to the three present in theEGR proteins) and in the case of ADR1 and TTKBa different pattern of contacts has been observedfor one of the two fingers (there are no structuraldata available for the MIG proteins). In the case ofthe yeast transcription factors MIG1 and MIG2(they are identical with respect to the “contactingamino acid residues”), SAMIE_C6 is able to predictcorrectly all but the last nucleotide of the 7 bp longDNA consensus target site. All of the knownnaturally occurred MIG target sites are ranked inthe top 5% of SAMIE’s list, and the rankings ofbases in each position correspond closely to thoseobtained in SELEX experiments.40 In the case ofADR1, SAMIE_C6 was able to predict correctlythe 4 bp subsite that is believed to be contacted byone of the two fingers, and the in vivo target siteranks in the top 1% of all possible sites. In the caseof TTKB, SAMIE_C6 predicts correctly all but thelast nucleotide of the 7 bp long target. Moreover, itdepicts correctly the 4 bp long subsite that isknown to be significant for the binding, and the invivo consensus sequences ranks in the top 0.4%.

    These examples show that SAMIE’s predictionscan be very useful for estimating the bindingspecificities of EGR proteins, and even for proteinsthat are not members of the EGR family but usethe Cys2His2 motif for DNA binding. There areseveral reasons why the predictions are not evenbetter than these results. First, even though thereare many examples of SELEX and phage-displaycombinations reported in the literature, more datashould provide more accurate estimates of theparameters. Second, we know the additivityassumption is not completely accurate and solimits our prediction ability. But the fact that we

    Probabilistic Protein-DNA “Recognition Code” 717

  • do fairly well indicates that the additive model is areasonable approximation. Third, we know that forsome proteins the contacting interactions vary,which will limit our accuracy. And in some cases,such as ADR1, additional amino acid residues out-side the zinc-finger contribute to binding speci-ficity. And fourth, the in vivo sites that we attemptto predict may be influenced by other constraints,such as overlapping sites for other proteins orcooperative interactions that modify the interactionof the EGR protein. But, despite these potentialproblems, the probabilistic model, and the SAMIEmethod of determining the parameters, appearquite promising for providing reasonably reliable,quantitative predictions of binding site speci-ficities, at least for EGR proteins and perhaps,with appropriate data, for other families as well.

    Correlation with measured binding energy data

    One of the advantages of the P-code model isthat it provides quantitative predictions for relativebinding energies. This allows us to predict theeffects of variations, in either the DNA or proteinsequence, on the interaction energy. We comparedSAMIE’s predictions of the binding energy changeswith experimentally determined values for DNAtargets bound by the EGR protein and its variants.In most cases, the SAMIE predictions are in closeagreement with the experimental values. Corre-lation coefficients (R ) between the measured KAterms and those predicted by the SAMIE_C6model are generally greater than 0.7, and some-times much higher. The cases with lower corre-lations are usually due to an overall lowspecificity of the protein to the DNA targets (non-specific binding).

    Comparison with other models

    Other currently available protein–DNA inter-action models include two† that are quantitativeand two that are qualitative. These terms refer tothe way that the models represent the base–aminoacid interactions. The quantitative models assign anumeric value (“weight”) to these interactions,whereas the qualitative models catalogue them aspermissible or non-permissible. The qualitativemodels can be considered as a degenerate form ofquantitative, where the weights of the interactionshave been adjusted to either 1 (permissible) or 0(non-permissible)‡. Any quantitative model can be

    transformed into qualitative, by setting a thresholdto separate the permissible from the non-permissible contacts.

    All currently available models assume energeticadditivity of the individual contacts. The mostimportant reasons are that: (a) in the majority ofthe cases the interactions are approximatelyadditive (especially in the high-affinity states);(b) data limitations make modelling of non-additive interactions impractical; (c) simplicity ofthe additive “codes”. However, we note that oneof the advantages of SAMIE is that its underlyingprinciples are quite general and thus it can accom-modate non-additive interactions at a cost, whichare more parameters to be estimated.

    The first quantitative model, developed bySuzuki’s group,10,47 is based on a set of chemicalrules for each base–amino acid pair (that are inde-pendent of the family modelled) and a set ofstereochemical rules derived from co-crystal struc-tures (protein family-specific). The combination ofthese sets of rules generates a score for any givenprotein–DNA pair. This score can then be used torank DNA targets according to their binding prob-abilities and predict possible binding sites. Thelimitations of this model are the requirement forthe crystal structure and the fact that the rules (orweights) have been assigned in a semi-arbitraryway. Comparison of SAMIE_C6 with this modelshowed that, in terms of success rate, Suzuki’smodel is slightly better in predicting SELEX data,and SAMIE_C6 was significantly better on phagedisplay data and the combined data. In terms ofspecificity index (an evaluation measure intro-duced by Suzuki et al.10) SAMIE_C6 was substan-tially better on all sets.

    The second quantitative model we comparedSAMIE with was that developed by Margalit’sgroup.7 For this model, the score of each base–amino acid pair is derived from the observedfrequency of this pair in a non-redundant set ofcrystal structures (training set). The score of theindividual contacts is calculated according to theformula:

    Sij ¼ ln½fij=ðfifjÞ� ð5Þ

    where fij is the pair frequency of amino acid i andbase j, fi is the frequency of amino acid i(i ¼ 1,…,20) in the SWISS-PROT database and fj isset to 0.25 for each base j ( j ¼ 1,…,4). The originalmodel was recently refined by including more con-tacts in the training set.13 We compare SAMIE_C6with the latter model because, in agreement withthe authors, we found that it performs better thantheir initial model. In all cases, SAMIE_C6 per-forms better than Margalit’s model. However, interms of specificity index, Margalit’s model isbetter than Suzuki’s in predicting the phagedisplay data.

    One disadvantage of Margalit’s method is that itassigns very high “penalties” (i.e. very lowweights) to certain base–amino acid pairs on the

    † The available quantitative models are, in fact, threebut two of them, that proposed by Margalit7,13 and thatdue to Kono & Sarai,12 are very similar to each other.

    ‡ The representation of permissible and non-permissible contacts with 1 and 0, respectively,corresponds to assigning probability values. In order tobe consistent with our previous notation (i.e. weightscorrespond to energy values), we can assign the value of0 (or any number) to the permissible contacts and thevalue of þ1 to the non-permissible contacts.

    718 Probabilistic Protein-DNA “Recognition Code”

  • grounds that they do not have contacting poten-tials. We believe that this might affect the predic-tive power of this model in some cases. Forexample, Gly with no hydrogen bond donor/acceptor potential has been assigned the mostnegative score. This prevents it from appearing inmany of the model’s predictions. Yet, by being asmall, non-polar amino acid, Gly can be toleratedin most cases. In our training set, there are 151examples with Gly