Colubri_JMolBiol_2006

24
This article was originally published in a journal published by Elsevier, and the attached copy is provided by Elsevier for the author’s benefit and for the benefit of the author’s institution, for non-commercial research and educational use including without limitation use in instruction at your institution, sending it to specific colleagues that you know, and providing a copy to your institution’s administrator. All other uses, reproduction and distribution, including without limitation commercial reprints, selling or licensing copies or access, or posting on open internet sites, your personal or institution’s website or repository, are prohibited. For exceptions, permission may be sought for such use through Elsevier’s permissions site at: http://www.elsevier.com/locate/permissionusematerial

description

Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006Colubri_JMolBiol_2006

Transcript of Colubri_JMolBiol_2006

  • This article was originally published in a journal published byElsevier, and the attached copy is provided by Elsevier for the

    authors benefit and for the benefit of the authors institution, fornon-commercial research and educational use including without

    limitation use in instruction at your institution, sending it to specificcolleagues that you know, and providing a copy to your institutions

    administrator.

    All other uses, reproduction and distribution, including withoutlimitation commercial reprints, selling or licensing copies or access,

    or posting on open internet sites, your personal or institutionswebsite or repository, are prohibited. For exceptions, permissionmay be sought for such use through Elseviers permissions site at:

    http://www.elsevier.com/locate/permissionusematerial

  • Auth

    or's

    per

    sona

    l co

    py

    Minimalist Representations and the Importance ofNearest Neighbor Effects in Protein Folding Simulations

    Andrs Colubri1,2,3, Abhishek K. Jha1,2,5, Min-yi Shen4, Andrej Sali4

    R. Stephen Berry1,5, Tobin R. Sosnick2,3 and Karl F. Freed1,51Department of Chemistry,The University of Chicago,Chicago, IL 60637, USA2Institute for BiophysicalDynamics, The Universityof Chicago, Chicago, IL 60637,USA3Department of Biochemistryand Molecular Biology,The University of Chicago,Chicago, IL 60637, USA4Departments ofBiopharmaceutical Sciencesand Pharmaceutical Chemistry,and California Institute forQuantitative BiomedicalResearch, University ofCalifornia, San Francisco,San Francisco, CA 94143, USA5The James Franck Institute,The University of Chicago,Chicago, IL 60637, USA

    In order to investigate the level of representation required to simulatefolding and predict structure, we test the ability of a variety of reducedrepresentations to identify native states in decoy libraries and to recoverthe native structure given the advanced knowledge of the very broadnative Ramachandran basin assignments. Simplifications include theremoval of the entire side-chain or the retention of only the C atoms.Scoring functions are derived from an all-atom statistical potential thatdistinguishes between atoms and different residue types. Structures areobtained by minimizing the scoring function with a computationally rapidsimulated annealing algorithm. Results are compared for simulations inwhich backbone conformations are sampled from a Protein Data Bank-based backbone rotamer library generated by either ignoring or includinga dependence on the identity and conformation of the neighboringresidues. Only when the C atoms and nearest neighbor effects areincluded do the lowest energy structures generally fall within 4 of thenative backbone root-mean square deviation (RMSD), despite the initialconfiguration being highly expanded with an average RMSD10 . Theside-chains are reinserted into the C models with minimal steric clash.Therefore, the detailed, all-atom information lost in descending to a C-level representation is recaptured to a large measure using backbonedihedral angle sampling that includes nearest neighbor effects and anappropriate scoring function.

    2006 Elsevier Ltd. All rights reserved.

    *Corresponding authorsKeywords: Ramachandran basin; structure prediction; protein structure;simulated annealing; statistical potential

    Introduction

    First principle models of protein folding gener-ally are preferred over statistical approachesbecause first principle models provide a theoreticalframework to explain the underlying mechanisms,whereas purely statistical approaches only func-tion as computational black boxes. However,

    most first principles approaches have the disad-vantage either of being too complex to becomputationally feasible for proteins with morethan a few residues1 or too simplistic to be usefulin situations where a more realistic representationis required.2 In addition, knowledge based meth-ods have improved substantially over the lastseveral years and today are capable of generatingnative structures,310 changes in stability uponmutation,11 disorder propensities,12 and bindingaffinities.13

    An interest in combining first principles methodswith statistical information has led us to constructa computational model that accommodates both aknowledge-based approach and a more funda-mental methodology. Our present focus is onwhether protein folding can be accurately depictedwith a backbone representation that either lacks

    Abbreviations used: DOPE, discrete optimized proteinenergy-function; DOPER, DOPE reduced; PDB, ProteinData Bank; SA, simulated annealing; RMSD, root-mean-square deviation; RB, Ramachandran basin; MD,molecular dynamics; LD, Langevin dynamics.E-mail addresses of the corresponding authors:

    [email protected]; [email protected]

    doi:10.1016/j.jmb.2006.08.035 J. Mol. Biol. (2006) 363, 835857

    0022-2836/$ - see front matter 2006 Elsevier Ltd. All rights reserved.

  • Auth

    or's

    per

    sona

    l co

    py

    any side-chain degrees of freedom or is repre-sented by at most a C atom. This simplifiedrepresentation greatly diminishes the conforma-tional search as it only involves the backbonedihedral angles, and , with no consideration ofside-chain rotamers.To make up for the loss of side-chain information,

    local interactions are incorporated by samplingusing a backbone rotamer library, constructedfrom the Protein Data Bank (PDB), that tabulatesdihedral angles for single residues and also forsequences of dimers or trimers according to theiramino acid identities and Ramachandran basin (RB)assignments.14 Secondly, the tertiary interactions aretreated using the scoring function, DOPE-C, whichhas residue-dependent parameters for all the atomsin the backbone and for all the C atoms. Structuresare obtained by minimizing the scoring functionwith a computationally rapid simulated annealing(SA) algorithm using the PDB-based backbonerotamer sampling. These simplifications greatlyreduce the search space and have been adopted invarious fashions by other studies of proteinfolding.1520

    Rather than conducting an extensive conforma-tional search through the entire backbone confor-mational space, effectively an ab initio structureprediction, our investigation focuses on the muchless formidable task of generating structures giventhe advanced knowledge of the very broad nativeRB assignments for each residue. In spite of thissimplification, we address a variety of issuesincluding how to properly incorporate all-atominformation in a reduced description of proteins.In the first place, it is not clear that a suitable scoringfunction for a reduced model can encode detailedpacking propensities arising from all-atominteractions.2126 Our scoring function includesonly terms involving at most the heavy atoms inthe backbone and the C atoms and is obtained froma previously described all-atom statistical potential,DOPE (discrete optimized protein energy-func-tion)27,28 Heavy atoms are distinguished by theresidue type to which they belong (e.g. C

    alanine

    interacts differently than a Cvaline).

    Because our model omits certain fine-grained,atomistic details of the system, we must ensure thatthe SA algorithm samples realistic regions ofconformational space. For example, it is possiblethat a transition that is acceptable in the reducedrepresentation would be forbidden if all additionalvariables were explicitly treated because localinteractions between neighboring atoms imposestrong geometric constraints on the motions of thebackbone. As a primary consequence of these localinteractions, the RBs are observed as the fivepredominantly occupied regions in , maps foreach amino acid (Figure 1). Previous ab initio foldingsimulations29,30 have suggested that using RBs toconstrain the conformational search in dihedralspace would allow incorporating underlying atom-istic information into the motions of a simplifiedmodel of a protein.

    Neighboring amino acids have also been shown toexert a substantial influence on the occupancies ofthe RBs.3133 This consideration provides the moti-vation for constructing a rotamer library of allowedbackbone conformations for monomers, dimers andtrimers, where amino acid information is coupledwith RB assignments to reduce the number ofallowed backbone conformations when knowledgeis available concerning the basin occupancies. Evenmore importantly, this library inherently satisfies theconstraints arising from the short range, all-atominteractions between nearest neighbor residues,information that is lost in the reduced representa-tions. In a very recent paper, Rose and co-workerstake a similar approach with remarkable success,although utilizing a library of pentamer rotamerconformationss with up to sevenfold smaller Rama-chandran mesostates, or sub-basins, along withthe specification of secondary structure assignmentsfor each amino acid.34

    Here we describe the performance of the foldingsimulations for more than 50 proteins using areduced representation that includes, in addition tothe heavy atoms of the backbone, either the removalof the entire side-chain or the retention of only theC atom and that either neglects or includes nearestneighbor (NN) effects in the backbone sampling. Thestructures are generated with the residues beingconstrained to their native RBs during the entire SAminimization. Although such advanced knowledgeprecludes this study from being an ab initio structureprediction, when a protein is constrained to broadRBs, the protein may adopt a huge number of non-native, highly extended conformations as illustratedin Figures below. Thus, success with this simplerproblem positions us to address our questionsrelated to what level of representation is requiredto accurately generate native structures. In addition,the approach is useful for screening the foldability ofdesigned sequences.

    Figure 1. Specification of the five Ramanchandranbasins. (blue), poly-proline II, PPII (green), R (red), L(magenta) and (grey). The color intensity reflects the (,) occupancy as calculated from all 4701 PDB structures.

    836 Minimalist Models and Neighbor Effects in Folding

  • Auth

    or's

    per

    sona

    l co

    py

    Reasonable success is defined by whether nativestructures can be obtained with decent accuracy(

  • Auth

    or's

    per

    sona

    l co

    py

    than their DOPER counterparts, which exhibitsubstantial variation among different amino acids.This difference is expected because the DOPER CC interaction includes the sum of interaction of allthe heavy atoms in the entire side-chain. As we passfrom interactions between pairs of amino acids withsmall side-chains (serinevaline, glutamic acidgtyrosine) to pairs of bulkier residues (phenylala-ninetryptophan and lysinetryptophan), theDOPER profiles increasingly differ from the explicitatom-atom DOPE interaction curves. It is alsoevident that the hydrophobicity of the amino acidsis captured by DOPER, as exemplified by thesignificantly increased strength between the phe-nylalaninetryptophan and lysinetryptophan ef-fective interactions. The dependence of the DOPEstatistical potential on both the atom types and theamino acid identities is illustrated in Figure 2(c) forselected CC interactions. This dependence onamino acid identity, even for the otherwise chemi-cally identical atom type, assists the statisticalpotential in describing the influence of side-chainpacking.

    Test with decoy sets

    To examine the loss of information incurred by useof the reduced models, we compare the energycomputed using the all-atom DOPE statisticalpotential with those of the three other scoringfunctions for reduced models of the seven proteinsincluded in the Park-Levitt four-state reduced decoyset,35 which can be downloaded from the Decoy RUs website. In the resulting scatter plots (Figure 3),the DOPE-C potential has the highest correlationcoefficient R0.9 with the all-atom potential, al-though the DOPE-BB and DOPER potentials yieldonly slightly lower coefficients, R0.85.We briefly describe the compilation of the three

    libraries of decoys used in this study. The Zhoudecoy set includes 96 standard multiple decoy setsof proteins with known X-ray structure.26 The Bakerdecoy set includes over 75,000 members for 41proteins whose structures have been determinedwith either X-ray or NMR experiments. This decoyset is generated using the Rosetta algorithm and is asubset of the structures used to test an all-atomscoring function.18 The study by Zhou et al. excludesfrom the original decoy sets those associated withproteins whose structures have been determined byNMR as well as decoy sets of globins andimmunoglobin. The decoys sets excluded by Zhouet al. comprise our third and final library, which islabeled as Others.To further examine the utility of the four scoring

    functions, we test and compare their ability toidentify the native structure in three differentlibraries of decoys (Table 1). The success rate (thepercent of native structures that are ranked by theirenergy scores as number one) is highest for DOPE,

    the all-atom potential, and decreases for reduced-model scoring functions (Table 1). In addition, theTop5 measure (the percent of native structures thatare ranked by their energy score as one of the fivestructures with lowest energy) displays a similartrend. Finally, we consider the Z-score, anothermeasure of the quality of these statistical potentialsfor dealing with different decoy sets The Z-score isdefined as:

    Z EnergyNative hEnergyDecoyilibraryhEnergyDecoy2libraryi hEnergyDecoyi2library

    q

    1where the angular brackets denote the average overthe library.36 Z-scores reflect the quality of both thescoring function and the decoy set (worse decoyssets result in better Z-scores). The performance ofthe energy functions is relatively poor for decoy setsthat include NMR structures. Moreover, the reducedenergy functions do not perform as well as the all-atom potential. Regardless, our overall goal is todevelop an algorithm for generating native-likestructures and not a statistical potential that canidentify the native state from a decoy set using Z-scores. From this perspective, we require a suitablepotential that can produce an accurate representa-tion of the structure of the native state for any givensequence when used in conjunction with an ade-quate sampling protocol.

    Intra-basin folding simulations

    As a first test of the models, we considersimulations that begin with a random assignmentof dihedral angles for each amino acid residuewithin their native RBs, and proceed by minimizingthe DOPE-C score using a SA algorithm (Figure 4).The dihedral angles are constrained to remainwithin their native RBs during the entire annealingrun. New conformations are generated by choosinga trimer, dimer, or monomer from the backbonerotamer library, subject to compatibility with theamino acid sequence and basin assignments. Thealgorithm first tries choosing dihedral angles fromtrimers, and defaults to dimers and then to mono-mers when configurations for the trimer or dimerare absent, respectively, in the rotamer libraries.After a fixed number of steps in which all moves areaccepted, the annealing follows a cooling schedulethat decreases the temperature until convergence isreached. In the simulations presented here, thenumber of steps at each temperature is 100.We consider 50 small globular proteins to test the

    intra-basin folding algorithm (Table 2). The 50-protein set is generated by combining 41 proteinsused by Baker et al.36 with nine additional common-ly studied proteins. This test set contains a hetero-geneous sample of different protein topologies: helix bundles, / proteins, and -only structures.A total of 100 separate trajectories are performed foreach protein, with every trajectory starting from ahttp://dd.stanford.edu/

    838 Minimalist Models and Neighbor Effects in Folding

  • Auth

    or's

    per

    sona

    l co

    py

    Figure 3. Correlation between all-atom DOPE scores and those of the three other reduced representations. Energiesare highly correlated between the all-atom DOPE statistical potential and the DOPE-C, DOPE-BB only, and DOPERscoring functions for seven proteins taken from the Park & Levitt four-state reduced decoy set. The high correlationindicates that each of the reduced scoring functions captures a large majority of the information content of the all-atombased statistical potential DOPE from which they are derived. (R values in parentheses.)

    839Minimalist Models and Neighbor Effects in Folding

  • Auth

    or's

    per

    sona

    l co

    py

    different random assignment of dihedral angleswithin the native basins. For each trajectory, eachsuccessive minimum is recorded until the conver-gence criterion is satisfied. Generally, about 50 suchminima from each trajectory are recorded. Using aPentium IV 2.8 Ghz with 512 Mb of RAM, each runtakes an average of 500 s for a protein of 70 residues,when executing 100 annealing steps per tempera-ture. The overall running time is roughly propor-tional to the number of annealing steps pertemperature; hence, a simulation performed with500 annealing steps per temperature is about fivetimes slower.Because simulations using reduced models

    could produce structures for which the sidegroups clash significantly, tests are made of side-chain packing. After each SA run using thereduced representation, the side groups areintroduced employing the SCWRL program ver-sion 3.0,37 without further backbone motions.SCWRL employs a simple energy function basedon a backbone-dependent rotamer library and apiece-wise linear repulsive steric energy to removeatom clashes. A final scoring is made using thefull heavy atom statistical potential DOPE toensure that the final structure properly describesthe protein packing.Comparisons of the RMSD for the lowest energy

    structure obtained before and after the introductionof the side groups, scored with the DOPE-C

    potential and with the all-atom DOPE potential,respectively, indicates that the inclusion of side-chains generally provides only a modest improve-ment in the RMSD of the lowest energy structure(Figure 5; Table 2, compare columns 5 and 6). Thistest has the physical significance of demonstratingthat our protein structure algorithm is indeedconsistent with good side-chain packing. Becausethe computational cost is minimal for introducingside-chains with SCWRL and for scoring with theall-atom DOPE potential of a single structure afterthe SA process (which utilizes a model withoutside-chain rotamers), this procedure of adding side-chains is applied throughout our study when

    identifying lowest energy structures and calculat-ing RMSD values.

    Intra-basin search is non-trivial

    Even when the energy minimization constrainseach residue's backbone to remain within its verybroad native RBs, a huge number of conformationsare still available. Initial configurations, generatedfrom the rotamer library by a random assignmentof dihedral angles in the native RBs, are highlyunfolded with dimensions comparable to dena-tured proteins. Moreover, these initial conforma-tions are assembled by piecing together trimers anddimers taken from the rotamer library, so that theconformations even satisfy local chemical andgeometrical correlations. However, as demonstrat-ed by Figure 6(a), the lowest RMSD of the initialconfigurations never falls below 5 , with theaverage around 10 or higher. It is also revealingto examine the relationship between the initialconfiguration and the final result of the minimiza-tion. Figure 6(b) presents representative scatterplots of the RMSD of the initial configurationversus the RMSD of the final configuration and ofthe initial against the final configurations' DOPEenergy for all annealing runs. The scatter plotsindicate that the outcome of the energy annealing isindependent of the proximity of the initial state tothe native structure.Figure 4 depicts typical initial 3D structures for

    1UBQ. The native structure of this protein featuresan -helix between residues 23 and 34, and theFigure shows that this helix is also partially presentin the initial structures. This appearance of a helicalportion arises because stretches of more than fourconsecutive residues in the RB are highlyconstrained to adopt a standard helical conforma-tion. However, these pictures also display a com-plete lack of native long-range interactions in theseinitial structures. Nevertheless, the presence ofpartial helical structure in the initial conformationssuggests that the SA minimization would beexpected to fare better for proteins. Thus, the 50

    Table 1. Success rates and Z-scores for different scoring functions and decoy sets

    Decoy (no. of proteins) Scoring function DOPE (all-atom) DOPE-C DOPER DOPE-BB

    Zhou (96) Success(%)a 83 65 57 59top 5(%)b 89 76 72 72Z-scorec 3.83.1 2.92.7 2.62.3 2.62.5

    Baker (41) Success(%) 27 20 0 12top 5(%) 44 27 15 27.5Z-score 1.52.2 0.92.3 0.71.6 0.82.2

    Others (171) Success(%) 18 11 9 6top 5(%) 37 20 16 14Z-score 0.72.1 0.42.4 0.082.03 0.972.79

    All (308) Success(%) 39 29 23 23top 5(%) 54 39 33 34Z-score 1.82.9 0.82.9 0.92.4 0.43.1

    a Success is defined as the native structure having the lowest energy score.b Top5 refers to number of total cases when the native is one of the five structures with lowest energy.c Z-score values are given as the average and the standard deviation for the decoy sets.

    840 Minimalist Models and Neighbor Effects in Folding

  • Auth

    or's

    per

    sona

    l co

    py

    proteins studied also include proteins whoseinitial structures are even more devoid of native,long-range structures.Many examples demonstrate the highly non-

    trivial character of the intra-basin search for gener-ating a good approximation to the native structure(Figure 7). For example, among the five lowestenergy structures for 1VII, a small -helix bundle of36 residues, the lowest energy conformation isalmost the correct native structure except for thepresence of an incorrect orientation for the C-terminal helix. This observation implies that thespecification of the RBs does not uniquely determinethe spatial arrangement of the secondary structuralelements, even in very simple cases such as this onewhere the initial structures contain some helicalportions.

    Generated structures

    The resulting all-atom structures are rankedaccording to their all-atom DOPE energy score,and the five conformations with lowest energies areselected for comparison with the native structure.From this set of five conformations, a structurewith less than 4 of backbone RMSD is found in44% of the cases, with no obvious correlation to thesize and secondary structure topology (Table 2).The accuracy tends to be slightly better for proteinswith only -helices, perhaps because structuresusually involve more complex topologies and long-range interactions, but more likely, because-helices are easily formed during the initialassignment of dihedral angles, thereby probablyexpediting the annealing search. Figure 7 presents

    Figure 4. Flow chart of the simulated annealing algorithm. 3D renderings of typical initial structures are shown at thetop for 1ubq. While these conformations display no native tertiary interactions, they feature the native helix in thecorrect position. The helix appears because long stretches of RBs almost uniquely determine alpha helicalconformations.

    841Minimalist Models and Neighbor Effects in Folding

  • Auth

    or's

    per

    sona

    l co

    py

    3D renderings of the predicted (lowest energy)structures for a selection of proteins, along with thesuperimposed native fold, and the correspondingscatter plots of the DOPE energy and RMSD for thestructures generated. Although the 50 target pro-teins are not explicitly excluded from the rotomer

    library, only 4, 6, 9, 12, and 16% of the pair ofdihedral angles from the 50 predicted structures(3035 pairs total) are found to lie within 1, 2, 3, 4,and 5 of the native dihedral angles, respectively.The lack of native angles in our structures indicatesthat the algorithm undergoes a meaningful search

    Table 2. Results for SA runs using DOPE-C

    PDBcode

    Class,Nres Source

    Nativelowestenergy

    PredictRMSD

    (all-atom)a

    PredictRMSD(C)b

    LowestRMSDc

    RMSD5 ). Hence, theaccompanying free energy surface is quite ruggedwhen a single folding trajectory is considered,even if the RMSD for multiple trajectoriescorrelates well with the energy. Native localbiases and collapse often are insufficient touniquely define the native fold. In order toidentify the native folds, clustering or othermethods are required. In contrast, real proteinsreadily fold to the native state in a two-statemanner,43,44 and hence, do not traverse thecomplicated landscapes observed in many simu-lations. This difference suggests that improve-ments in energy functions or better move sets arerequired for a more realistic description of thefolding process.

    848 Minimalist Models and Neighbor Effects in Folding

    Figure 7 (legend on page 846)

  • Auth

    or's

    per

    sona

    l co

    py

    Future improvements

    The previous sections indicate three areas wherethe algorithm might be substantially improved: aricher rotamer library, a better cooling schedule, anda more detailed energy function. The rotamer library

    could be enriched by two different approaches. Thefirst approach would be to include more structuresin our training set, while the second would be toconstruct a library of trimer conformations usingdimer information already available in the library.Another interesting possibility is to generate

    Table 3. Results for SA runs using DOPER1

    PDBcode

    Class,Nres Source

    Nativelowestenergy

    PredictRMSD

    (all-atom)a

    PredictRMSD(C)b

    LowestRMSDc

    RMSD