Evolution of protein structural classes and protein ...present-day organisms represents a...

6
Evolution of protein structural classes and protein sequence families In-Geol Choi* and Sung-Hou Kim* †‡ *Physical Biosciences Division, Lawrence Berkeley National Laboratory, and Department of Chemistry, University of California, Berkeley, CA 94720 Contributed by Sung-Hou Kim, July 30, 2006 In protein structure space, protein structures cluster into four elongated regions when mapped based solely on similarity among the 3D structures. These four regions correspond to the four major classes of present-day proteins defined by the contents of second- ary structure types and their topological arrangement. Evolution of and restriction to these four classes suggest that, in most cases, the evolution of genes may have been constrained or selected to those genetic changes that results in structurally stable proteins occu- pying one of the four ‘‘allowed’’ regions of the protein structure space, ‘‘structural selection,’’ an important component of natural selection in gene evolution. Our studies on tracing the ‘‘common structural ancestor’’ for each protein sequence family of known structure suggest that: (i) recently emerged proteins belong mostly to three classes; (ii) the proteins that emerged earlier evolved to gain a new class; and (iii) the proteins that emerged earliest evolved to become the present-day proteins in the four major classes, with the fourth-class proteins becoming the most domi- nant population. Furthermore, our studies also show that not all present-day proteins evolved from one single set of proteins in the last common ancestral organism, but new common ancestral pro- teins were ‘‘born’’ at different evolutionary times, not traceable to one or two ancestral proteins: ‘‘the multiple birth model’’ for the evolution of protein sequence families. protein fold classes common structural ancestor evolutionary age protein structure universe T he protein universe (1), the totality of all proteins in all organisms on earth, is vast. However, an estimate of the order of magnitude can be made (Table 1): Although the currently known genome sizes range from 10 6 to 10 11 DNA base pairs, the number of genes are estimated to range only from 10 3 to 10 5 per organism (www.ncbi.nlm.nih.govgenomes). Taking the estimated 13.6 million species of living organisms on Earth (2), which is very likely to be an underestimate, into account, there are 10 10 to 10 12 different proteins in all organisms from the three domains of life (Eukarya, Bacteria, and Archaea) on Earth. However, this vast number of proteins are predicted to consist of only 10 5 sequence domain families (3), the members of each family having similar amino acid sequences (4). The sizes of the sequence domain families have a power law distribution (Fig. 1): Most families have a small number of members, but some have a very large number of members. We expect similar distribution for sequence families. Most of these 10 5 sequence families are estimated to belong to 10 4 structural families (5–7), because some sequence families turn out to have the same 3D structural fold. Some protein structures consist of more than one domain and, at present, 10 3 structures of fold domains are known (8). There has been a long history of attempts for classification of known protein structures based on subjective analysis of the secondary structure contents of proteins and their topological arrangements in the structures (9–12) and on objective analysis of 3D coordinates of C atoms in protein structures (1, 13, 14). These attempts resulted in, among others, two excellent data- bases of protein structure classification, CATH (14) and SCOP (11). A recent study, based solely on objective similarity among the 3D structures represented by C atoms and using a much larger structure database and multidimensional scaling, revealed that all of the known protein folds (15) and protein structures (16) cluster into four elongated regions in the very sparsely populated protein structure space (Fig. 2). Interestingly, these four groups correspond approximately to the four classes de- fined by Levitt and Chothia (9) and used in SCOP, the Structural Classification of Proteins (11). The fact that most of proteins are structured and that the protein structure space is very sparsely populated and restricted mostly to the four elongated regions suggest that mutations in genes encoding proteins have been constrained to those resulting in a structurally viable protein occupying one of the four allowed regions of the protein structure space: structural selection or ‘‘designability’’ (17, 18). To obtain information on the evolution of these structural classes, we present a simple way of estimating the evolutionary ages of the common structural ancestor (CSA) of each protein sequence family of known fold. Assigning the age of the CSA of a protein family represented by each representative protein in the protein structure space (16) makes it possible to imbed the evolutionary information into the map of protein structure universe. We assign the age of the CSA of a protein family to be the same as the age of the most recent common ancestral organism that presumably contained the CSA of the family. Finally, we convert the map of the protein structure universe into the map of the ages of CSAs. Based on the analysis of these maps of protein structure universe and the evolutionary ages of the CSAs, we propose a model for the evolution of protein structural classes and a model for the evolution of protein sequence families. We start with the following facts and assumptions: (i) There is a key difference between the evolution of organ- isms vs. the evolution of proteins: the current model of Author contributions: I.-G.C. and S.-H.K. designed research; I.-G.C. performed research; I.-G.C. and S.-H.K. analyzed data; and I.-G.C. and S.-H.K. wrote the paper. The authors declare no conflict of interest. Freely available online through the PNAS open access option. Abbreviations: CSA, common structural ancestor; MRCA, most recent common ancestor. To whom correspondence should be addressed. E-mail: [email protected]. © 2006 by The National Academy of Sciences of the USA Table 1. The estimated orders of magnitude of the total numbers in various categories for all proteins in all organisms on Earth Category Estimated orders of magnitude Genome size (base pairs) 10 6 to 10 11 No. of genes in an organism 10 3 to 10 5 No. of living organisms on Earth 10 7 Size of the protein universe on Earth 10 10 to 10 12 No. of protein sequence families 10 5 No. of protein structural families 10 4 Protein fold of known structure 10 3 14056 –14061 PNAS September 19, 2006 vol. 103 no. 38 www.pnas.orgcgidoi10.1073pnas.0606239103 Downloaded by guest on March 16, 2020

Transcript of Evolution of protein structural classes and protein ...present-day organisms represents a...

Page 1: Evolution of protein structural classes and protein ...present-day organisms represents a time-sliced view of all proteins at their various stages of evolution; we propose a possible

Evolution of protein structural classes and proteinsequence familiesIn-Geol Choi* and Sung-Hou Kim*†‡

*Physical Biosciences Division, Lawrence Berkeley National Laboratory, and †Department of Chemistry, University of California, Berkeley, CA 94720

Contributed by Sung-Hou Kim, July 30, 2006

In protein structure space, protein structures cluster into fourelongated regions when mapped based solely on similarity amongthe 3D structures. These four regions correspond to the four majorclasses of present-day proteins defined by the contents of second-ary structure types and their topological arrangement. Evolution ofand restriction to these four classes suggest that, in most cases, theevolution of genes may have been constrained or selected to thosegenetic changes that results in structurally stable proteins occu-pying one of the four ‘‘allowed’’ regions of the protein structurespace, ‘‘structural selection,’’ an important component of naturalselection in gene evolution. Our studies on tracing the ‘‘commonstructural ancestor’’ for each protein sequence family of knownstructure suggest that: (i) recently emerged proteins belong mostlyto three classes; (ii) the proteins that emerged earlier evolved togain a new class; and (iii) the proteins that emerged earliestevolved to become the present-day proteins in the four majorclasses, with the fourth-class proteins becoming the most domi-nant population. Furthermore, our studies also show that not allpresent-day proteins evolved from one single set of proteins in thelast common ancestral organism, but new common ancestral pro-teins were ‘‘born’’ at different evolutionary times, not traceable toone or two ancestral proteins: ‘‘the multiple birth model’’ for theevolution of protein sequence families.

protein fold classes � common structural ancestor � evolutionaryage � protein structure universe

The protein universe (1), the totality of all proteins in allorganisms on earth, is vast. However, an estimate of the

order of magnitude can be made (Table 1): Although thecurrently known genome sizes range from 106 to 1011 DNA basepairs, the number of genes are estimated to range only from 103

to �105 per organism (www.ncbi.nlm.nih.gov�genomes). Takingthe estimated 13.6 million species of living organisms on Earth(2), which is very likely to be an underestimate, into account,there are �1010 to 1012 different proteins in all organisms fromthe three domains of life (Eukarya, Bacteria, and Archaea) onEarth. However, this vast number of proteins are predicted toconsist of only �105 sequence domain families (3), the membersof each family having similar amino acid sequences (4). The sizesof the sequence domain families have a power law distribution(Fig. 1): Most families have a small number of members, butsome have a very large number of members. We expect similardistribution for sequence families. Most of these �105 sequencefamilies are estimated to belong to �104 structural families(5–7), because some sequence families turn out to have the same3D structural fold. Some protein structures consist of more thanone domain and, at present, �103 structures of fold domains areknown (8).

There has been a long history of attempts for classification ofknown protein structures based on subjective analysis of thesecondary structure contents of proteins and their topologicalarrangements in the structures (9–12) and on objective analysisof 3D coordinates of C� atoms in protein structures (1, 13, 14).These attempts resulted in, among others, two excellent data-bases of protein structure classification, CATH (14) and SCOP(11). A recent study, based solely on objective similarity among

the 3D structures represented by C� atoms and using a muchlarger structure database and multidimensional scaling, revealedthat all of the known protein folds (15) and protein structures(16) cluster into four elongated regions in the very sparselypopulated protein structure space (Fig. 2). Interestingly, thesefour groups correspond approximately to the four classes de-fined by Levitt and Chothia (9) and used in SCOP, the StructuralClassification of Proteins (11).

The fact that most of proteins are structured and that theprotein structure space is very sparsely populated and restrictedmostly to the four elongated regions suggest that mutations ingenes encoding proteins have been constrained to those resultingin a structurally viable protein occupying one of the four allowedregions of the protein structure space: structural selection or‘‘designability’’ (17, 18).

To obtain information on the evolution of these structuralclasses, we present a simple way of estimating the evolutionaryages of the common structural ancestor (CSA) of each proteinsequence family of known fold. Assigning the age of the CSA ofa protein family represented by each representative protein inthe protein structure space (16) makes it possible to imbed theevolutionary information into the map of protein structureuniverse. We assign the age of the CSA of a protein family to bethe same as the age of the most recent common ancestralorganism that presumably contained the CSA of the family.Finally, we convert the map of the protein structure universe intothe map of the ages of CSAs. Based on the analysis of these mapsof protein structure universe and the evolutionary ages of theCSAs, we propose a model for the evolution of protein structuralclasses and a model for the evolution of protein sequencefamilies.

We start with the following facts and assumptions:

(i) There is a key difference between the evolution of organ-isms vs. the evolution of proteins: the current model of

Author contributions: I.-G.C. and S.-H.K. designed research; I.-G.C. performed research;I.-G.C. and S.-H.K. analyzed data; and I.-G.C. and S.-H.K. wrote the paper.

The authors declare no conflict of interest.

Freely available online through the PNAS open access option.

Abbreviations: CSA, common structural ancestor; MRCA, most recent common ancestor.

‡To whom correspondence should be addressed. E-mail: [email protected].

© 2006 by The National Academy of Sciences of the USA

Table 1. The estimated orders of magnitude of the totalnumbers in various categories for all proteins in allorganisms on Earth

Category Estimated orders of magnitude

Genome size (base pairs) 106 to 1011

No. of genes in an organism 103 to 105

No. of living organisms on Earth �107

Size of the protein universe on Earth �1010 to 1012

No. of protein sequence families �105

No. of protein structural families �104

Protein fold of known structure �103

14056–14061 � PNAS � September 19, 2006 � vol. 103 � no. 38 www.pnas.org�cgi�doi�10.1073�pnas.0606239103

Dow

nloa

ded

by g

uest

on

Mar

ch 1

6, 2

020

Page 2: Evolution of protein structural classes and protein ...present-day organisms represents a time-sliced view of all proteins at their various stages of evolution; we propose a possible

evolution of organisms has the absolute requirement ofreproduction of organisms and, thus, all present-day organ-isms ultimately come from one common ancestor organism.However, the evolution of proteins, therefore genes, doesnot need to follow the evolutionary path of organismicreproduction. Rather, the evolution of proteins is directlyrelated to improved, unaltered, or diversified molecularfunctions, and the protein function is directly related toprotein structure.

(ii) Protein structures are more conserved than sequences inevolution, thus most proteins in a given sequence familyhave similar or related molecular structures.

(iii) All information about protein structures is derived from theproteins of present-day organisms, and the protein universeof the present-day organisms represents a time-sliced viewof all proteins at their various stages of evolution.

ResultsEvolutionary Age of CSAs. Mapping the protein structure universerevealed four major clusters of protein structures (1, 15, 16). Anexamination of the map suggested a hint of imbedded evolutionarytime in the map. To estimate the ‘‘age’’ of a protein structure in themap, we define the term CSA: For a given protein structure, all itssequence homologues are searched from a sequence database, forexample, from the Pfam database (3), and all of the organisms thatcontain the genes coding for the members of that sequence familyare identified. We then find the most recent common ancestor(MRCA) node of these organisms in the phylogenetic tree of lifeconstructed based on the small subunit rRNA gene as described inMaterials and Methods. We make an assumption that the CSA of theprotein and its family members was present in the MRCA organism(Fig. 3a), and that the age of the CSA is represented by thephylogenetic distance between the MRCA and the reference nodein the tree. The proecdure is shown schematically in Fig. 3b.

Evolution of the Relative Abundance of the Protein Structural Classes.When each protein structure in the protein structure space (Fig.2) is represented by the relative age of the CSA (Fig. 4a) of the

protein family to which it belongs, we see a general trend: theproteins with young CSA age (blue) belong mostly to threeclasses (�, �, and ��� classes), those with middle age (green oryellow) belong to the same three classes plus ��� class, and,finally, the majority of the CSAs of old age (red) belong to ���class. This observation suggests that recently born and still-evolving proteins belong to all-� or all-� class (as well as theirrandom mixtures, ��� class), but the majority of the ‘‘mature’’proteins belong to ��� class. The trends of the evolution of theprotein structural classes are more easily visible in a distributionof structural classes across the evolutionary ages (Fig. 4b Upper)or the relative percent population of structural classes in a givenevolutionary age (Fig. 4b Lower).

We also notice that the protein chain lengths correlate sig-nificantly (Spearman’s rank correlation coefficient r � 0.3098,P � 2 � 10�16) with the ages of CSAs (Fig. 4c). Theseobservations combined with the assumption that the present-dayproteins represent the entire spectrum of proteins at differentstages of evolution from their respective CSAs, we propose ascenario for the evolution of protein structural classes: ancestralproteins of small short secondary structures primarily in threeclasses (�, �, and ��� classes) evolve to medium-sized proteinsof four classes (�, �, ���, and ��� classes) in roughly similarproportions, then to larger proteins with a preponderance in ���class, as schematically shown in Fig. 5.

Evolution of Protein Families: Multiple Birth Model. We have ex-panded our approach to estimate the evolutionary ages of all

Fig. 1. The family sizes of protein sequence domains in Pfam database (3)(release 16; 7,677 Pfam families) have a power law distribution. The Pfam families(x axis) were sorted by their family size. The number of members in a given Pfamfamily (yaxis)wastruncatedfrom�1,000 inthisplot.Thedottedline indicatesthemedian family size (41) of all protein sequence families in Pfam.

Fig. 2. A global view of the protein structure space (16). The 1,898 nonre-dundant protein structures from Protein Data Bank are mapped in the 3Dspace to visualize the major feature of the map. The protein structure space issparsely populated, and all of the proteins of known structures cluster mostlyinto four elongated regions, which correspond approximately to four SCOPclasses (all-�, all-�, ���, and ���) of protein structures indicated by red,yellow, purple, and cyan spheres, respectively. The small proteins and multi-domain protein classes are represented by green and black spheres, respec-tively. All structural class assignments were based on the SCOP classification.Three axes are drawn in to visualize high-population regions of all-�, all-�, and��� class proteins, and the ‘‘origin’’ is represented by a large orange ball at thepoint where two of the axes meet.

Choi and Kim PNAS � September 19, 2006 � vol. 103 � no. 38 � 14057

EVO

LUTI

ON

Dow

nloa

ded

by g

uest

on

Mar

ch 1

6, 2

020

Page 3: Evolution of protein structural classes and protein ...present-day organisms represents a time-sliced view of all proteins at their various stages of evolution; we propose a possible

curated protein sequence families in Pfam. As was evident from theages of the CSAs of the proteins with known structural folds, notall present-day proteins are evolved from the proteins of the lastcommon ancestor, but new CSAs can be traced to various pointsthrough out the evolutionary time. The above information com-bined again with the assumption that the protein universe of thepresent-day organisms represents a time-sliced view of all proteinsat their various stages of evolution; we propose a possible scenariofor the evolution of protein families as illustrated in Fig. 6. Wehypothesize that, although all present-day organisms may haveevolved from the last common ancestral organisms by organismicreplication from an ancestor to a descendant organism, most of thepresent-day protein families were not evolved from their ancestorproteins existed in the last common ancestral organism (Fig. 6a) asexpected for ‘‘the single birth model’’ of protein family evolution(19), but new CSAs were born throughout evolutionary time (Fig.6b, the multiple birth model of protein evolution), and they evolvedto the present-day proteins or died out.

DiscussionWe emphasize that our studies are aimed at gaining a coarse-grained global view and overall trends associated with theevolution of the protein structure classes and sequence families.In our multistep processes projecting the evolutionary ages onto

the protein structural space map, many details are ‘‘smoothedout’’ to extract the major trends of evolution of the proteinstructure classes and sequence families, such as the effect ofhorizontal gene transfer and sampling of only those globularproteins for which the 3D structures are known. For example, weremove those proteins that may have entered an organismthrough horizontal gene transfer by the jackknife test as de-scribed in Materials and Methods. Some of our conclusion isconsistent with others. For example, the ��� class proteins as themost ancient proteins also have been suggested by parsimoniousscenario of fold occurrence in genomes (20), and birth, death,and diversification of genes have been described in ref. 21.

There are several questions invoked by the features of the proteinstructural space and its evolutionary implications. Some of them areas follows:

(i) How is the gene for a new CSA born? Because the new CSAhas no traceable single ancestral protein, we propose thatthe new gene for the CSA was constructed of multiple genefragments, for example, by multiple recombination eventsmediated by phages, viruses, or other mechanisms.

(ii) Is the protein structure space constantly expanding or hasit reached an equilibrium state? One possible argument forthe equilibrium state is that a newly born protein evolves

Fig. 3. Schematic diagram for building a phylogenetic tree representing all of the organisms that contain the proteins of known structures or their sequencehomologues (a) and assigning the age of the CSA of a protein family (b). The MRCA organism of the organisms represented by the members of a protein familyis traced in the e tree (red solid line). We then assume that the CSA resided in MRCA organism, and we assign the phylogenetic distance (the sum of black thicksolid lines) from present day to MRCA as the age of the CSA.

14058 � www.pnas.org�cgi�doi�10.1073�pnas.0606239103 Choi and Kim

Dow

nloa

ded

by g

uest

on

Mar

ch 1

6, 2

020

Page 4: Evolution of protein structural classes and protein ...present-day organisms represents a time-sliced view of all proteins at their various stages of evolution; we propose a possible

into gradually larger-sized proteins of improved, neutral, ordiversified functions until it reaches an equilibrium, atwhich point destabilizing effects of the large size (of theprotein, thus, its gene) outweigh the additional changes infunction or diversity.

(iii) What is the implication of Fig. 3 that reveals three evolutionarystages where the relative abundance of the four major proteinstructure classes changed their relative ranking? One possibleimplication is that there were three evolutionary periods whenthe Earth environment changed dramatically.

Fig. 4. Evolutionof therelativeabundanceof theproteinstructural classes. (a)The ‘‘agemap’’ofCSAs.Thecolorgradient, fromblue (theyoungest) tored(theoldest),represents the relative age of the CSAs of the protein families represented by each of the nonredundant protein structures and their sequence homologues. We usedthe average age of 22 nearest neighbors (the median of the number of statistically significant neighbors; Dalilite z score � 2) of each point to reveal the major trendsby smoothing out the noise. The proteins near the origin are youngest, and those near the end of ��� axis are the oldest. (b) Relative abundance of structural classesto which all of the protein families belong vs. the ages of their CSAs. (b Upper) The total number of CSAs in each structural class is normalized to 1, so that the populationdensity of each fold class is plotted vs. relative evolutionary ages. (b Lower) The sum of CSAs in all four classes at a given age is normalized to 1, and the percentagepopulation for four classes is plotted for that age. Both are alternative presentations of Fig. 4a. (c) The ‘‘chain length map’’ of the protein structure space, where eachof the nonredundant protein structures is represented by its chain length: from blue for proteins of short chain length (�50 residues) to red for those of long chainlength (�300 residues). The proteins near the origin are short, and those near the ends of the feature axes are large.

Choi and Kim PNAS � September 19, 2006 � vol. 103 � no. 38 � 14059

EVO

LUTI

ON

Dow

nloa

ded

by g

uest

on

Mar

ch 1

6, 2

020

Page 5: Evolution of protein structural classes and protein ...present-day organisms represents a time-sliced view of all proteins at their various stages of evolution; we propose a possible

Materials and MethodsConstruction of a Phylogenetic Tree Representing All of the OrganismsThat Contain the Proteins of Known Structures or Their SequenceHomologues. We used the 1,898 protein chains representing anonredundant set of all of the known protein structures in ProteinData Bank (PDB) [PDB�select 25 data set (22) used by Hou et al.(16) for mapping the protein structure space] as a reference dataset. For each chain, we identified the protein domain family in thePfam database (Release 16.0) (3) to which it belongs and all of theorganisms represented by the members of the family. To recon-struct the phylogenetic tree of organisms covering all members ofthe retrieved protein families, we combined the taxonomic sourcesof all members of the protein families and extracted nonredundantspecies (65,532 organisms).

To simplify the tree structure, we grouped the nonredundantorganisms at a higher level of taxonomic classification. When thefourth level of taxonomic classification listed in the Pfam data-base is used for grouping, the final number of taxa was reduced

to 468. Among them, the gene sequences of small subunit rRNA(16S rRNA of prokaryotes or 18S rRNA of eukaryotes) of 345taxa were available in the European ribosomal RNA database(23) (Table 2, which is published as supporting information onthe PNAS web site). For each of these 345 taxa, we chose thelongest small subunit rRNA sequence (but not shorter than 1,200bases). Using these prealigned rRNA sequences, a universal treeof life for the 345 taxa was constructed by using neighbor joining(NJ) and maximum likelihood (ML) methods by using thePHYLIP package (24). For the NJ tree, we used 100 boot-strapped sequence replicates and obtained a consensus tree.Because the consensus tree does not produce branch lengths, thebranch lengths of the consensus tree were recalculated frommaximum likelihood method while keeping NJ tree topology.The ML tree was built under the assumption of constant ratewith F84 model of sequence evolution. For both tree-buildingprocedures, we used a bacterial taxon (Aqufiex pyrophilus) as anoutgroup. Because both trees were topologically similar and theevolutionary ages of protein structural classes calculated by themethod (see below) showed no disparity in terms of shape ofdistribution and overall trend (Fig. 7, which is published assupporting information on the PNAS web site), we selected theML tree as a reference tree for the purpose of obtaining a globalview of the phylogenetic relationship among the organisms at ahigher taxonomic level.

Estimating the Relative Age of the CSAs. We make the assumptionthat the CSA of a given protein sequence family appeared mostrecently in an organism at the MRCA node and the age of theCSA is represented by the branch length between the MRCAand a common reference node. To determine the MRCA nodein the tree, we mapped all members of the Pfam family to whicha given protein of known structure belongs on the tree. Toremove the effect of horizontal gene transfer on estimating theCSA ages, we first tested the congruency of multiple MRCAnodes of the organisms represented by a protein family by usinga jackknife operation: Each MRCA node is identified for allmember organisms minus one, and the evolutionary age of theMRCA is estimated; examine the ages of the multiple MRCAsand remove those that are statistically outside of the mono-modal distribution; and take the median value of the remaining

Fig. 5. Proposed scenario for the evolution of protein structural classes extrapolated from the age map of common structural ancestors. The age map (Fig. 4a) is asnapshot at present time of the global evolutionary process of protein structural classes. The observation that the age map is highly correlated with the chain lengthmap (Fig. 4c) suggests an evolutionary history of structural classes in which ancestral proteins of small and short secondary structures primarily in three classes (�, �, and��� classes) evolve to medium-sized proteins in four classes (�, �, ���, and ��� classes) with the least amount in ��� class, then to larger proteins with a preponderancein ��� class.

Fig. 6. Model for the evolution of protein families: (a) Single birth model ofprotein families, where all of the present-day protein families are evolved fromthe proteins, existed in the last common ancestral organism. Each colored circlerepresentsaCSAofaprotein sequencefamilyofpresentday,whoseevolutionarypath is schematically shown by the tree of the same color. The large rectangularbox surrounding the circles indicates the single birth event of the CSAs of allprotein sequence families, some of which died out. (b) Multiple birth model ofprotein families, where the CSAs of the present-day protein families (representedby circles in squares) emerged at different points in the evolutionary time.

14060 � www.pnas.org�cgi�doi�10.1073�pnas.0606239103 Choi and Kim

Dow

nloa

ded

by g

uest

on

Mar

ch 1

6, 2

020

Page 6: Evolution of protein structural classes and protein ...present-day organisms represents a time-sliced view of all proteins at their various stages of evolution; we propose a possible

ages. The ages of CSAs were normalized to be in the range of 0to 1. These relative ages are assigned to each of the 1,898nonredundant protein structures in the protein structure spaceto visualize the major features of the distribution of the ages vs.protein structure classes (see below).

Mapping the Relative Ages of CSAs on the Protein Structure Space.Mapping of the relative ages of proteins of known structure isdone in two stages: First, 1,898 nonredundant protein structuresare positions (mapped) in the protein structure space based ontheir all-to-all structural similarities as described in Hou et al.(15, 16) and briefly summarized below. As mentioned earlier, weused the PDB�select 25 data set, which contained 1,949 proteinchains with �25% pairwise sequence identity. Of those, 51chains were further removed because of low resolution or lengthrequirements of the DaliLite (25) program that we used tocalculate the similarity of protein structures. The remaining dataset has 1,898 chains. The pairwise structural similarity for the1,898 protein chains were measured by using the DaliLiteprogram. The 1,898 � 1,898 similarity score matrix [sij] (wherei � 1,. . . ,1,898; j � 1,. . . ,1,898) was converted to dissimilaritymatrix [dij], ‘‘distance metrix,’’ by using

dij � �S99.95 � Sij, (S99.95 � Sij, i � j)0, (i � j)S99.95, (otherwise)

, [1]

where s99.95 is the 99.95 percentile value of the maximum valueamong all off-diagonal sij’s (i.e., i � j). The dissimilarity matrix

then was subject to the classical multidimensional scaling (MDS)procedure (26) to find the positional coordinates in a multidi-mensional (1,898 dimension) space of the protein structureuniverse. We used s99.95 to prevent a few extremely largesimilarity scores from dominating the distribution feature of thestructural space map. To capture and visualize the major fea-tures of the high dimensional space, we represent the proteinstructure space in three dimensions (Fig. 1) by using the threecomponents with highest eigenvalues, which are substantiallygreater than the rest.

Second, we represent the relative age of each of the nonre-dundant protein structures by the relative age of the CSA of thesequence family to which that particular protein belongs. Then,we population average by replacing the age of each CSA by theaverage age of 22 nearest neighbors weighted on the distances inthe map. The number of nearest neighbors was chosen by themedian of the number of statistically significant score pairs(DaliLite z score �2) of 1,898 protein chains. The weightedpopulation averaging is to visualize the major trends of the ‘‘agemap of CSAs’’ and to smooth out the ‘‘noise’’ due to factors suchas horizontal transfer of genes, sparse sampling of proteinfamilies, and the members of each family (Fig. 4).

We thank Drs. Jingtong Hou, Gregory Sims, and Se-Ran Jun for ourweekly discussions on the subjects of this work as well as other relatedsubjects and Drs. David Eisenberg, Norman Pace, and Yun S. Song,whose expertise helped us to improve our thoughts, for valuable com-ments. This work has been supported by National Institutes of HealthGrant GM62412.

1. Holm L, Sander C (1996) Science 273:595–603.2. Hawksworth PM, Kalin-Arroyo MT (1995) Global Biodiversity Assessment

(Cambridge Univ Press, Cambridge, UK).3. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna

A, Marshall M, Moxon S, Sonnhammer ELL, et al. (2004) Nucleic Acids Res32:D138–D141.

4. Dayhoff MO (1976) Fed Proc 35:2132–2138.5. Wolf YI, Grishin NV, Koonin EV (2000) J Mol Biol 299:897–905.6. Denton M, Marshall C (2001) Nature 410:417.7. Coulson AF, Moult J (2002) Proteins 46:61–71.8. Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG

(2004) Nucleic Acids Res 32:D226–D229.9. Levitt M, Chothia C (1976) Nature 261:552–558.

10. Richardson JS (1977) Nature 268:495–500.11. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) J Mol Biol 247:536–540.12. Chothia C, Hubbard T, Brenner S, Barns H, Murzin A (1997) Annu Rev Biophys

Biomol Struct 26:597–627.13. Michie AD, Orengo CA, Thornton JM (1996) J Mol Biol 262:168–185.

14. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM(1997) Structure (London) 5:1093–108.

15. Hou J, Sims GE, Zhang C, Kim SH (2003) Proc Natl Acad Sci USA 100:2386–2390.

16. Hou J, Jun SR, Zhang C, Kim SH (2005) Proc Natl Acad Sci USA 102:3651–3656.

17. Li H, Helling R, Tang C, Wingreen N (1996) Science 273:666–669.18. Tiana G, Shakhnovich BE, Dokholyan NV, Shakhnovich EI (2004) Proc Natl

Acad Sci USA 101:2846–2851.19. Chothia C, Gough J, Vogel C, Teichmann SA (2003) Science 300:1701–1703.20. Winstanley HF, Abeln S, Deane CM (2005) Bioinformatics 21(Suppl 1):i449–

i458.21. Koonin EV, Wolf YI, Karev GP (2002) Nature 420:218–223.22. Hobohm U, Sander C (1994) Protein Sci 3:522–524.23. Wuyts J, Perriere G, Van de Peer Y (2004) Nucleic Acids Res 32:D101–D103.24. Felsenstein J (1989) Cladistics 5:164–166.25. Holm L, Park J (2000) Bioinformatics 16:566–567.26. Havel TF, Kuntz ID, Crippen GM (1983) J Theor Biol 104:359–381.

Choi and Kim PNAS � September 19, 2006 � vol. 103 � no. 38 � 14061

EVO

LUTI

ON

Dow

nloa

ded

by g

uest

on

Mar

ch 1

6, 2

020