ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

11
In: Molecular Biology of Picornaviruses. Eds: B. Semler & E. Wimmer, ASM Press, pp 149-155 (2002) ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA Ann C. Palmenberg* and Jean-Yves Sgro Insti. for Molecular Virology , University of Wisconsin-Madison,, 1525 Linden Drive, Madison, WI 53706. Contact Information: Ann Palmenberg, Ph: 608-262-7519, FAX: 608-262-6690, E-mail: [email protected] Jean-Yves Sgro, Ph: 608-262-7464, FAX: 608-262-7414, E-mail: [email protected] *Corresponding Author (for galley proofs): A. Palmenberg, Institute for Molecular Virology, University of Wisconsin-Madison, 1525 Linden Drive, Madison, WI 53706 Background : Picornavirus sequence data have accumulated at an astonishing rate. Since the first 1060 bases at the 3’ terminus of PV-1 Mahoney genome were published in 1980 (10), the polio database alone has grown to include at least 12 complete genomes and more than 112,000 bases from >36 strains. A minimum of 150 other complete picornavirus genomes also have been sequenced, with >90 additional strains available as significant sequence fragments (>3000 b), in a tally that does not even begin to include several hundred thousand additional bases from a multitude of smaller segments generated in epidemiological or evolutionary studies. A rough estimate of 2,000,000 bases, would not undervalue the current dataset. International sequence agencies like GenBank (www.ncbi.nlm.nih.gov/Genbank) record the published picornavirus offerings under the broad sub-group classification of “viruses.” The annotated information may be searched, compared and retrieved with standard Internet tools. But if you don’t know what you’re looking for, or whether a favored strain is actually on deposit in GenBank, a better place to start is The Picornavirus Sequence Database (www.iah.bbsrc.ac.uk/virus/picornaviridae), a beautifully designed web page compiled by Nick Knowles at the Institute for Animal Health, Pirbright, UK. This comprehensive site catalogues all known family member into appropriate genera and species according to the latest taxonomic designations. GenBank accession numbers, published (and unpublished) references, fragment size and strain origin are registered for every sequence. Should you need to know, for example, that a 327 bp fragment from the HRV-72 1B gene (GenBank: Z47574) was available for comparison (9), The Picornavirus Home Page is the place to inquire. This outstanding resource, like a huge technical library, represents an invaluable research service contribution by Nick, and we are certainly indebted for his extraordinary efforts. Nevertheless, placing books on shelves is not the same as reading them. Companion literary tools such as dictionaries or thesauruses can also help make any text more informative, especially when the contents are written in a foreign language. Comparative sequence analysis, in some respects, is analogous to compiling a thesaurus for Nature’s language. Although various sequence segments may appear superficially to encode similar biological meanings, it is true in Nature, as in English, that there are very few exact synonyms. That is, few “words” in either language are ever observed to exactly substitute for each other in alternative contexts without changing their logic or specific meaning. Just as the etymology of words reflects the sum of their linguistic histories, the subtle peculiarities of each sequence are indicative of unique evolutionary histories. Sequence context and content can both contribute to phenotype. Genomics is a young field, and unfortunately, when it comes to the language of life, our current interpretive skills more closely resemble gauche tourists in Paris, rather than fluent natives. Any singular ribosome still translates the delicate nuances of the basic genetic code with more precision and finesse than our best bioinformaticists. Yet we continue to compare sequences because (hubris aside), it’s still one of the small number of available tools with any possibility of breaking down biological syntax

Transcript of ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

Page 1: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

In: Molecular Biology of Picornaviruses. Eds: B. Semler & E. Wimmer, ASM Press, pp 149-155 (2002)

ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

Ann C. Palmenberg* and Jean-Yves Sgro

Insti. for Molecular Virology , University of Wisconsin-Madison,, 1525 Linden Drive, Madison, WI 53706. Contact Information: Ann Palmenberg, Ph: 608-262-7519, FAX: 608-262-6690, E-mail: [email protected] Jean-Yves Sgro, Ph: 608-262-7464, FAX: 608-262-7414, E-mail: [email protected] *Corresponding Author (for galley proofs): A. Palmenberg, Institute for Molecular Virology, University of Wisconsin-Madison, 1525 Linden Drive, Madison, WI 53706 Background : Picornavirus sequence data have accumulated at an astonishing rate. Since the first 1060 bases at the 3’ terminus of PV-1 Mahoney genome were published in 1980 (10), the polio database alone has grown to include at least 12 complete genomes and more than 112,000 bases from >36 strains. A minimum of 150 other complete picornavirus genomes also have been sequenced, with >90 additional strains available as significant sequence fragments (>3000 b), in a tally that does not even begin to include several hundred thousand additional bases from a multitude of smaller segments generated in epidemiological or evolutionary studies. A rough estimate of 2,000,000 bases, would not undervalue the current dataset. International sequence agencies like GenBank (www.ncbi.nlm.nih.gov/Genbank) record the published picornavirus offerings under the broad sub-group classification of “viruses.” The annotated information may be searched, compared and retrieved with standard Internet tools. But if you don’t know what you’re looking for, or whether a favored strain is actually on deposit in GenBank, a better place to start is The Picornavirus Sequence Database (www.iah.bbsrc.ac.uk/virus/picornaviridae), a beautifully designed web page compiled by Nick Knowles at the Institute for Animal Health, Pirbright, UK. This comprehensive site catalogues all known family member into appropriate genera and species according to the latest taxonomic designations. GenBank accession numbers, published (and unpublished) references, fragment size and strain origin are registered for every sequence. Should you need to know, for example, that a 327 bp fragment from the HRV-72 1B gene (GenBank: Z47574) was available for comparison (9), The Picornavirus Home Page is the place to inquire. This outstanding resource, like a huge technical library, represents an invaluable research service contribution by Nick, and we are certainly indebted for his extraordinary efforts.

Nevertheless, placing books on shelves is not the same as reading them. Companion literary tools such as dictionaries or thesauruses can also help make any text more informative, especially when the contents are written in a foreign language. Comparative sequence analysis, in some respects, is analogous to compiling a thesaurus for Nature’s language. Although various sequence segments may appear superficially to encode similar biological meanings, it is true in Nature, as in English, that there are very few exact synonyms. That is, few “words” in either language are ever observed to exactly substitute for each other in alternative contexts without changing their logic or specific meaning. Just as the etymology of words reflects the sum of their linguistic histories, the subtle peculiarities of each sequence are indicative of unique evolutionary histories. Sequence context and content can both contribute to phenotype. Genomics is a young field, and unfortunately, when it comes to the language of life, our current interpretive skills more closely resemble gauche tourists in Paris, rather than fluent natives. Any singular ribosome still translates the delicate nuances of the basic genetic code with more precision and finesse than our best bioinformaticists. Yet we continue to compare sequences because (hubris aside), it’s still one of the small number of available tools with any possibility of breaking down biological syntax

Page 2: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

- - 2

into the underlying codes. Within such compilations, we vest the continual hope of teasing from shadowy precedents, those common elements of circumstance and function that perhaps foretell a significant phenotypic impact. Why does this gene convey persistence? How does this IRES attenuate the virus? What specific capsid changes might allow compromised immunity?

Since the authors’ earliest feeble pair wise comparisons between the 3’ UTRs of EMCV (3) and

PV1 (10), it’s been clear that properly formed picornavirus alignments can have some practical use in this regard (15). Good alignments are never trivial to generate, but given that one can learn to understand and respect the principles that underlie algorithmic reconstruction of evolution, reasonable datasets can be created by judicious application of computing power and logic. Another rather trickier problem lies in the definition of a suitable presentation format for effective public data mining. In 1989, with our (then) comparative sequences refined by capsid crystallographic data, we published an 11 page figure summarizing the relative orientations of 33 picornavirus P1-region sequences (16), an exercise we thought rather impressive at the time. With today’s much larger alignments, that density would translate into 150 published pages, just for the current protein sequences. Obviously, the printing logistics for huge alignments are hopelessly impractical. Instead, the updated picornavirus datasets cited in this chapter have been placed on a public web site (www.bocklabs.wisc.edu/picorna), and mirrored by the Picornavirus Home Page (above). This site also lists the strain abbreviations, gives accession numbers, sequence bibliographies and algorithms, as well as provides tables of base compositions, codon frequencies, genome RNA folds (17)and other comparative picornavirus data. Reference use of this information or its future updates should cite this chapter as the published source. How the Alignments were Formed: These days anyone with a hard drive and an Internet connection can align sequences. The commonly available CLUSTAL algorithm and similar pair wise distance methods give a reasonable fit for closely related sequences or for proteins that are obvious homologs of similar length. Very nice alignments for 3Cpro, 3Dpol, or within individual genera have been generated in this way (19). Genome connoisseurs however, are always looking for new methods to tweak the data, especially in sequence regions where good fits are harder to achieve or recognize. Our previous capsid alignments were based on superimposition of virion crystal structure hydrogen-bonding maps, then extended by multiple, reiterative pair wise comparisons to include similar related sequences (16). More recently, we used those alignments as founder data for profile hidden Markov model analyses (HMM) aimed at the optimal inclusion of newer, less-related sequences (4). Profiles are position-specific scoring matrices that sum the variability for each column in an existing alignment (7) creating chained likelihood tables (a Markov model) of substitution probabilities (4). New sequences are then fit snugly against the total framework of the profile. Using the HMMER program suite (Wisconsin Package Version 10.2, Genetics Computer Group, Madison, Wisc.), full-length picornavirus genome HMM-profiles (lengths of 8-9000 b) were calculated from the previous alignments (16). The profiles proved remarkably powerful at positioning new 5’ and 3’ UTRs and fitting them into the composites. Parallel polyprotein HMM-profiles (lengths of 2500-3000 aa) also helped ease new protein-coding sequences into logical, in-frame slots that were by definition, consistent and optimal with regard to all other members of a given genus. Back-translation of the protein alignments and linkage with the UTR fragments, gave genome-length RNA alignments that were in agreement with the best fits for the proteins and for the UTRs. The new files include (nearly) all available non-redundant sequence fragments of at least 2000 bases in length, as listed in The Picornavirus Sequence Database (January, 2001). The alignments were checked for consistency, and then refined mathematically and heuristically to (a) maximize the number of matched bases and encoded amino acids, (b) minimize the location and frequency of indels, and (c) emphasize the conservation of homologous features such as catalytic sites and proteolytic cleavage sites.

Page 3: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

- - 3

Current Picornavirus Alignments: The current iterations now include genome-length alignments for the seven most populous

picornaviral genera (Enterovirus, Rhinovirus, Cardiovirus, Aphthovirus, Hepatovirus, Parechovirus, and Teschovirus), and extend over 173 different strains, providing formats for about 1,000,000 bases. The correlate polyprotein alignments, derived by translation of the aligned RNAs, include about 291,000 amino acids. Kobuvirus and Erbovirus data will be added when there are sufficient strains for comparison. The new datasets, clearly impractical in printed form, are Internet accessible at: www.bocklabs.wisc.edu/picorna. The web files are displayed as (edited) output from the GCG Wisconsin Package PRETTY program and should be downloaded with a monospace font to maintain alignment continuity. No consensus is shown, but upper case letters highlight alignment columns where at least 2/3 of the sequences conserve an identical residue. The exact plurality required for these votes varies by alignment, as described within each file. Alternative GCG file formats (*.msf or *.rsf) may be requested by e-mail ([email protected]). The web site also includes RNA and protein alignments for composite picornaviral genera, covering all genome regions with common evolutionary origins (e.g. P1, 2C, 3B, 3C and 3D). Similarity Plots: What then, do the alignments teach us about synonymous regions in picornaviruses? As avowed sequence addicts, we’ve spent countless hours staring at alignment columns and marveling at the complex variation versus conservation in the lexicon of viral proteins. Anyone who revels in this level of detail is guaranteed to find personal enlightenment in the web files. Admittedly however, even for the book of life, the repetitive text can become mind numbing after a while, and there is a certain truth to the adage, “One picture is worth a thousand words.” In this case, we chose graphs instead of pictures, but the point is similar (or is that identical?). After many false starts with different summary methods, it was ultimately decided that only full-length polyproteins should be allowed to vote in the graph-building process. This culled many sequences from the deeper, more redundant P1 regions and kept them from overpowering the P2+P3 regions, where invariably, there were fewer strains in each alignment. The chosen sequences (n = 3 to 33, depending upon the genus), were run through the GCG program, PLOTSIMILARITY, to capture an averaged identity score within a window (win =20) that slid across the alignment. Window averaging smoothes the plot contours and makes it easier to identify to regions with different identity profiles. Figure 1, A-G shows respectively, the relative amino acid conservation in the Enterovirus, Rhinovirus, Aphthovirus, Hepatovirus, Parechovirus, Cardiovirus, and Teschovirus polyproteins. The full-length sequences for individual species were then segregated further and independently recalculated to emphasize how they varied among themselves, while contributing to the contours of the parental genera. All plots for all species with at least two complete members are available on the web. For brevity, only some are shown here. The number of participating sequences (n), the averaged polyprotein identity (average), and the count of alignment columns where all residues were identical (identities), is shown relative to a polyprotein map, punctuated with appropriate cleavage sites. The protein distance scales are numbered according to alignment positions in the files for each genus. To state the obvious, conserved residues in the P2+P3 regions generally outnumber those in the P1 region for every alignment. However, the details in the distribution of conserved elements display a much more subtle and powerfully pattern. Not surprising, and without exception, there is a 1:1 correspondence between all mapped picornavirus antigenic sites and the observed P1 troughs of limited sequence identity. In the Rhinoviruses, for example, the first three segments with less than 50% identity correspond exactly to the EF-loop of 1B, the BC-loop of 1C, and the BC-loop of 1D. Better known as Nim-II, Nim-III, Nim-IA, respectively, these are the antigenic neutralization sites on HRV-14 (18;20), and presumably other HRVs as well. The shorter Nim-1B site in the DE-loop of 1D (alignment position ~720), and the COOH-end of 1D are hypervariable in all rhinos. The Enterovirus P1 troughs, as typified by poliovirus, again correspond to the Nim sites mapped on the surface of the PV-1 particle (14). Also as expected, the deepest

Page 4: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

- - 4

trough in the Aphthovirus 1D protein (position ~950), is the antigenic “FMDV-loop” between the GH strands (1). Less dominant antigenic contributions from other FMDV P1 segments are consistent with this profile.

Relative to these genera, the P1 regions of Cardioviruses and Hepatoviruses are almost flat line.

The EMCV and HAV are both represented by multiple sequences (n = 9, n = 15 respectively) but each has only one serotype. Nevertheless, the little blips within the 1B and 1D (EMCV), and at the amino end of 1D (HAV), indeed denote known neutralization sites (2;5). The ThV, which share less antigenic cross-reactivity than the EMCV, have deeper troughs in the same locations, presumably from the residues that contribute to this diversity. The Teschoviruses and Parechoviruses have not yet been mapped for epitopes, but since the graphs mirror the sequence variability in the alignments, extrapolation from the better-known viruses clearly point to several probable antigenic sites. Tracings for the HPeV P1 region plot, actually superimpose quite with well with many elements of the HRV plot. In contrast, the Teschovirus plot suggests a prominent, immunogenic FMDV loop, as well as another very reactive site in 1B.

In several of these graphs, the P1-P2 boundary marks an apparently abrupt transition in sequence conservation. To illustrate with the Teschoviruses, the data from 30 complete polyproteins, representing 11 PTV serotypes (21) share an average identity of 82% in the P1 region, and 97% in the P2+P3 regions. The PV (83% and 96%), HEV-B (72% and 95%), FMDV (74% and 94%) and HPeV (81% and 92%) also have abrupt transitions. Again, it’s the individual sequence patterns, not the averages that should be of interest here. Surely, immunogenic pressures are responsible for the high mutation fixation rates in the epitope-encoding regions. That’s why there’s local variability in the loops. But the P1 segments between these troughs, representing internal beta sheets or helices have, in fact, identity values quite similar to the P2+P3 region averages. That is, the infrastructure of each genome as a whole, including internal portions of the capsid region, tends to fix mutations at a uniform rate across the species lineages. On the other hand, the non-structural regions aren't uniformly equivalent either. All the plots show the central segment of protein 2C with higher than average conservation levels. The sacrosanct region spans about 150 amino acids, with 97-99% averaged identity, and includes the NTP binding motif (Prosite # PDOC00017, [AG]-x(4)-G-K-[ST]), which probably marks the functional core of this protein. The lower than average variability here, again reflects a special selective pressure that works to moderate mutation fixation, with the goal of conserving some key structure or activity that is apparently vital for all picornaviruses.

If one looks closely, each genus shows additional regions with greater or lesser genetic stability, as

reflected by variability in the alignments. The FMDV 3A region and the EMCV 2A region are among the most variable in their respective species, while the polio, HEV-B and PTV 3D proteins are almost invariant. These plots and the alignments that formed them record the sum of the selective pressures that honed these segments during evolution. There is nothing new about the concept of different mutational fixation rates for different regions of a genome. The concept becomes important only when we want to accurately compare genomes or use their collective histories to retrace evolution. Family history, genus phylogeny, and relationships between and among species, should be followed by using the largest possible sequence fragments that are true homologs and have developed under similar selective pressures. Stated another way, long-term relationships are best fished out of the quiet end of the gene pool. Invariant regions like 2C lend themselves to comparisons outside of the family. Proteins 3B, 3C, 3D and the beta regions of P1 have obvious shared family ancestry with similar fixation rates and their relative mutations reasonably record the history of each genus. The chaotic mutational fixation in the epitope regions, like 1D, are better suited for following short term associations, like strain variation during outbreaks. The remaining genome regions like L, 2A, 3A proteins, and the 5’ and 3’ UTRs may be analogs in their functions, but they are not always genetic homologs. These regions are the best examples of loosely structured biological synonyms that can be quite useful in ordering species within genera, or strains

Page 5: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

- - 5

within species, but their informational content is very sensitive to specific genome context, and is rarely directly comparable. Nucleotide Preferences: If the alignments and similarity plots tabulate the thesaurus of sequence synonyms, then base composition and codon frequency data are like dictionaries of regional dialects, reflecting nucleotide preferences and codon biases that are characteristic and diagnostic of each species. The tendency for Hepatoviruses or Rhinoviruses to have overabundant A+U sequences, while the Aphthoviruses are rich in G+C, for example, has been well-documented (16). While we stirred the picornavirus sequence pot for the new alignments, it seemed opportune to revisit these parameters along the lines of the updated family taxonomy. Base composition and codon frequency data were collected for 164 strains, representing 19 of the 20 (current) picornaviral species (11). Porcine enterovirus A (PEV-A, 1 kb) was excluded only because there were too few bases for reliable comparison. Within the other species, several redundant strain iterations were sometimes excluded to avoid over weighting by any particular serotype. The final dataset (919 kb total) included: Bovine enterovirus (BEV; 23 kb), Human enterovirus A (HEV-A; 18 kb), Human enterovirus B (HEV-B; 144 kb), Human enterovirus C (HEV-C; 16 kb), Human enterovirus D (HEV-D; 17 kb), Poliovirus (PV; 59 kb), Porcine enterovirus B (PEV-B; 7 kb), Human rhinovirus A (HRV-A; 58 kb), Human rhinovirus B (HRV-B; 10 kb), Encephalomyocarditis virus (EMCV; 77 kb), Theilovirus (ThV; 49 kb), Foot-and-mouth disease virus (FMDV; 161 kb), Equine rhinitis A virus (ERAV; 31 kb), Hepatitis A virus (HAV; 131 kb), Avian encephalomyelitis virus (AEV; 7 kb), Human parechovirus (HePV; 17 kb), Equine rhinitis B virus (ERBV; 9 kb), Porcine teschovirus (PTV; 77 kb), Aichi virus (AiV; 8 kb). The genome base compositions (omitting 5’ polyC tracts and 3’ polyA tails) were very consistent within each tabulated species. The standard deviation of averaged values for any given species (e.g. %A content among all FMDV) was never greater than 2%, and more commonly, was less than 0.5%. The same was true for codon frequencies, in that the codon distribution of each sequence was closely mirrored by other members of that species. Among species, however, there was a strong skew for particular base compositions and codon frequencies (Fig 2). At the far end of the scale, the AiV clearly had the most anomalous composition. Although admittedly, there is only one current complete sequence in this species (GenBank: AB010145), additional data from smaller 3CD-region fragments are consistent in that the C content (38%), nearly exceeds the combined purine content (19% A, 21% G) of the entire genome. Pyrimidines therefore, make up an astonishing 62% of the AiV base count. The bias extends through both UTRs (A:G:C:U of 19%:32%:24%:25%), but is most evident in the coding region, where more than half of the reading frame triplets end in C (xxA = 11%, xxG = 17%, xxC = 54%, xxU = 18%) To understand the magnitude of this extraordinary skew, it’s important to remember that five of the twenty amino acids (Glu, Lys, Met, Gln and Trp) representing 14% of the Aichi virus polyprotein composition can’t even be encrypted by xxC codons (e.g. Met = AUG, Lys = AAR, etc.), so the bias can only be exerted through encryption of the other fifteen amino acids. The triplet assignments in the standard genetic code are not random. The code is an evolutionary masterpiece that generally works to minimize the impact of mutational and error frequencies caused by polymerase and tRNA misincorporations. Sterically, the third codon position is the most difficult to read accurately by tRNA-mRNA pairings. Although this is the most degenerate position of the present code (perhaps originally a spacer base), more often than not, mispairings allow the same or similar amino acid to be inserted, regardless of specific nucleotide selection. The second codon position is the least error prone and probably most closely reflects the conservation of an original singlet code (4 bases, 4 amino acids), which discriminated functionality based on "inside" (hydrophobic: xCx or xUx) and "outside" (hydrophilic: xAx) amino acids. The first position of a codon tends to signify the relative size of the encoded amino acid R-chain (small: Axx or Gxx; large: Cxx or Uxx), with a special partiality towards

Page 6: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

- - 6

Gxx in eucaryotes, as a possible ribosomal proof checking mechanism (12). Because of this codon balance, degeneracy, transitions or errors in the first or second codon positions are likely to cause only minimum coding damage (UUA6CUA = Leu) or at worst, permit conservative substitutions (UUA6GUA = Val). The average eucaryotic or procaryotic protein composition tends to distribute more or less equivalently among “inside,” “outside,” “large,” and “small” amino acids, so it is very unusual to find a significant nucleotide preferences in the first two codon positions (12;13). Among all other picornaviruses, except for AiV, this rule is generally observed, and the variance is only a few percent among other sequences, in their selection of first or second codon bases (see Fig 2). The average picornavirus ratios for Axx:Gxx:Cxx:Uxx are 30:31:19:20 (StDev of <4% for any base). For xAx:xGx:xCx:xUx the average ratios are 31:16:25:28 (StDev of <1.5% for any base). The AiV on the other hand, have much higher than average C compositions (Cxx = 28%, xCx = 34%) at all codon bases, signifying an extraordinary selective pressure on this genome’s content, that carries even into the first and second codon bases as well as the third base. Other picornaviruses also have different atypical skews, though none is as dramatic as the AiV. The HAVs have high A+U content (29% and 33%) that presents as a marked preference in third codon position (xxA = 28%, xxU = 42%, xxC = 10%), but is also influenced by an unusual under-representation of any codons with CG dinucleotides, such as GCG (Ala), UCG (Ser), ACG (Thr), CGX (Arg), and CCG (Pro). No codons with CG dinucleotides are used in any HAV sequence more than 2-3 times per reading frame, and certain codons, most notably CGG (Arg), GCG (Ala) and CCG (Pro), are excluded entirely from many genomes. The Rhinoviruses (HRV-A and HRV-B) have similar prejudices in their preference for high A+U content, but with the ratios reversed (A = 31%, U = 29%). Again these genomes discriminate against CGG (Arg), CCG (Pro), UCG (Ser), ACG (Thr) and GCG (Ala) codons, selecting nearly any alternative synonymous triplet at a 50-60 fold higher frequency than those with CG dinucleotides. The CGG (Arg), UCG (Thr), and CCG (Pro) codons are absent from most HRV sequences, and none is used more 20 times in total among the 21,000 codons surveyed for this genus. At the other end of the scale, the FMDV have high G+C contents and favor codons ending with these bases (xxG = 27%, xxC = 37%). Consequently, the FMDV shun a different cohort of triplets, like CUA (Leu), GUA (Val), UUA (Leu) and AUA (Ile), but are rich in CG-containing codons. Each species of Cardioviruses, Enteroviruses, Parechoviruses and Teschoviruses likewise can be seen to have individual characteristics that border on diagnostic, with regard to base and codon preferences. The ThV prefer U+C in the third codon position; the AEV prefer U but not C; the PV and EMCV are reasonably equivalent in their base distributions, etc. So why should we care about these sequence tendencies? Evolution leaves its fingerprints on every page in the book of life. The clustering of base compositions and codon preferences are telling us volumes about shared lineages among viral strains, just like gene maps, IRES types, and proteolytic processing schemes are recognized predicators of familial homology. The peculiar pressures or bottlenecks that shaped one genome, will have also shaped its brethren and progeny. It’s hard to imagine any biological situation that would favor 54% of the codons ending in C, but in truth, even stranger skews are common in other organisms. The sequence of Microplasma capricolum is 25% G+C, while Micrococcus leuteus is 75% G+C (12). Originally thought to be a logical adaptation to growth temperature, most evidence now suggests that genome G+C content is more closely related to phylogeny (6), and follows whatever twisted history was survived by the parental lineage. In the procaryotes, the presence of mutagens, suboptimal polymerase base preferences, defective repair mechanisms or other directional mutational pressures are known to vary considerably among phylogenetic lines, and each of these pressures can contribute to a different mutational fixation rate that manifests in characteristic composition preferences (12). However bizarre the outcome, we are formed by the trials and cultures that were survived by our parents.

Page 7: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

- - 7

For eucaryotes, there is very little information about the specific pressures that might shape a given lineage. Certainly, the symmetrical methylation of CG dinucleotides has a unique structural significance in many higher DNA genomes, reducing the value and frequency of this sequence in coding regions (8). There would be no reason for a cell to support CG codons, if they weren’t present in their own mRNAs. Moreover, unlike procaryotes, where entire genomes usually skew as a unit, the chromosomes of vertebrates are complex mosaics of G+C rich regions and G+C poor regions. Eucaryotic replication occurs in multiple phases, and each phase can use different enzymes, with the G+C rich regions generally copied before the G+C poor regions, in every cell cycle (6). The repertoire of required tRNAs needed to translate expressed messages, must therefore follow these cycles, particularly in cells with slow division times, like neurons. In short, the cumulative environment of A+U and G+C pressures can vary widely among vertebrates and tissues according to host age and cell cycle timing. We should not be surprised that our viruses adapt to these circumstances, we should only regret that we don’t yet understand what these tendencies are telling us about their niches. Acknowledgments. Comparative sequence analyses of picornaviruses is supported by NIH grant AI-17331 to ACP. References:

1.Barnett, P.V., E.J. Ouldridge, D.J. Rowlands, F. Brown, and N.R. Parry. 1989. Neutralizing epitopes of type O foot-and-mouth disease virus. I. Identification and characterization of three functionally independent, conformational sites. J.Gen.Virol. 70:1483-1491.

2. Boege, U., S. Onodera, D.G. Scraba, G.D. Parks, and A.C. Palmenberg. 1991. Characterization of Mengo virus neutralization epitopes. Virology 181:1-13.

3. Drake, N.L., A.C. Palmenberg, A. Ghosh, D.R. Omilianowski, and P. Kaesberg. 1982. Identification of the polyprotein termination site on encephalomyocarditis viral RNA. J.Virol. 41:736-729.

4. Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14:755-763. 5. Emini, E.A., J.V. Hughes, D.S. Perlow, and J. Boger. 1985. Induction of hepatitis A virus-

neutralizing antibody by a virus-specific synthetic peptide. J.Virol 55:836-839. 6. Filipski, J. 1991. Evolution of DNA sequences. Contribution of mutational bias and selection to the

origin of chromosomal compartments. Adv. Mutagenesis Res. 2:1-54. 7. Gribskov, M. and S. Veretnik. 1996. Identification of sequence patterns with profile analysis.

Methods in Enzymology 266:198-212. 8. Herbert, A. and A. Rich. 1996. Topology and formation of left-handed Z-DNA. J.Biol.Chem.

271:11595-11598. 9. Horsnell, C., R.E. Gama, P.J. Hughes, and G. Stanway. 1995. Molecular relationships berween

21 human rhinovirus serotypes. J.Gen. Virology 76:2549-2555. 10. Khesin, Y.E., A.M. Amchenkova, N.E. Gulevich, and A.E. Narovlysanskii. 1978. Association

between group G chromosomes, especially chromosome G21, and susceptibility of human cells to coxsackie B viruses. Bulletin of Experimental Biology and Medicine. 86:1493-1496.

11. King, A.M.Q., F. Brown, P. Christian, T. Hovi, T. Hyypiä, N.J. Knowles, S.M. Lemon, P.D. Minor, A.C. Palmenberg, T. Skern, and G. Stanway. 2000. Picornaviridae, p. 657-678. In AnonymousVirus Taxonomy. Seventh Report of the International Committee for the Taxonomy of Viruses. Academic Press, New York.

12. Osawa, S. 1995. Evolution of the Genetic Code. Oxford University Press, New York. 13. Osawa, S., T.H. Jukes, K. Watanabe, and A. Muto. 1992. Recent evidence for evolution of the

genetic code. Microbiological Rev. 56:299-264. 14. Page, G.S., A.G. Mosser, J.M. Hogle, D.J. Filman, R.R. Rueckert, and M. Chow. 1988. Three-

dimensional structure of poliovirus serotype 1 neutralizing determinants. J.Virol. 62:1781-1794. 15. Palmenberg, A.C. 1987. Genome organization, translation and processing in picornaviruses, p. 1-

Page 8: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

- - 8

15. In D.J. Rolands, B.W.J. Mahy, and M. Mayo (eds.), The Molecular Biology of Positive Strand RNA Viruses. Academic Press, London.

16. Palmenberg, A.C. 1989. Sequence alignments of picornaviral capsid proteins, p. 211-241. In B. Semler and E. Ehrenfeld (eds.), Molecular Aspects of Picornavirus Infection and Detection. ASM Publications, Washington,DC.

17. Palmenberg, A.C. and J.-Y. Sgro. 1997. Topological organization of picornaviral genomes: statistical prediction of RNA structural signals. Seminars in Virology 8:231-241.

18. Rossmann, M.G., E. Arnold, J.W. Erickson, E.A. Frankenberger, J.P. Griffith, H.-J. Hecht, J.E. Johnson, G. Kamer, M. Luo, A.G. Mosser, R.R. Rueckert, B. Sherry, and G. Vriend. 1985. The structure of a human common cold virus (Rhinovirus 14) and its functional relations to other picornaviruses. Nature 317:145-153.

19. Ryan, M.D. and Flint.M. 1997. Virus-encoded proteinases of the picornavirus super-group. J. Gen. Virology 78:699-723.

20. Sherry, B., A.G. Mosser, R.J. Colonno, and R.R. Rueckert. 1986. Use of monoclonal antibodies to identify four neutralization immunogens on a common cold picornavirus, human rhinovirus 14. J.Virol. 57:246-257.

21. Zell, R., M. Dauber, A. Krumbholz, A. Henke, E. Birch-Hirschfeld, A. Stelzner, D. Prager, and R. Wurm. 2001. Porcine teschoviruses comprise at least eleven distinct serotypes: molecular and evoutionary aspects. J.Virol. 75:1620-1631.

Figure Legends Figure 1. Polyprotein similarity plots. The protein alignments for picornaviral genera were culled to include only full-length sequences. The GCG program PLOTSIMILARITY calculated the average shared identity (window = 20) along the length of each alignment. The data for each genus and for separately calculated select species were plotted relative to a polyprotein map showing the cleavage site divisions and numbered according to the parental alignment columns. The dashed horizontal line and the “average” value on each plot refer to the overall identity shared among included sequences (calculated with window = 1). The “identities” value gives the percent of alignment columns where all (n) sequences were the same. This value (occasionally) includes some locations with shared gaps, in addition to locations with identical amino acids. The Parechovirus plot is partial (1ABC only), because a full-length sequence of Ljungan virus, the type member of the LV species (GenBank: AF020541), and the only non-HPeV strain in this genus, was not available. Figure 2. Codon and genome base compositions. Base compositions for 19 picornavirus species were summed separately then averaged for (a) the whole genomes, (b) 1st codon base, (c) 2nd codon base, and (d) 3rd codon base. Each concentric ring represents an individual species, subdivided according to the averaged observed frequency (percent) of A:G:C:U. The data were sorted in ascending order by A content in the 3rd codon position, and maintain the same order in each plot: (inside ring) AiV, ThV, FMDV, ERBV, BEV, AEV, ERAV, HEV-A, EMCV, PTV, HEV-B, PV, PEV-B, HEV-C, HAV, HPeV, HEV-D, HRV-B, HRV-A (outside ring).

Page 9: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

5000 1000 1500 2000

1AB 1C 1D 2A 2B 2C 3AB 3C 3D

100

50

100

50

Sequence Identity for Aligned Enterovirus Polyproteins

polio (n=8)average 93.4%identities 86.7%

HEVB (n=10)average 91.1%identities 79.0%

Enteroviruses (n=33)average 69.8%identities 30.0%

L 1AB 1C 1D 2AB 2C 3ABBB 3C 3D

5000 1000 1500 2000

Sequence Identity for Aligned Aphthovirus Polyproteins

A B

C

5000 1000 1500 2000

5000 1000 1500 2000

1AB 1C 1D 2A 2B 2C 3AB 3C 3D

5000 1000 1500 2000

Sequence Identity for Aligned Rhinovirus Polyproteins

Rhinovirus A (n=7)average 82.6%identities 66.0%

Rhinoviruses (n=8)average 75.0%identities 43.7%

5000 1000 1500 2000100

50

100

50FMDV (n=8)average 89.5%identities 74.1%

100

50

Aphthoviruses(n=10)average 70.5%identities 29.9%

100

50

100

50

HEVA+C (n=4)average 66.3%identities 51.4%

100

50

Page 10: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA

Sequence Identity for Aligned Hepatovirus Polyproteins

Parechoviruses (n=3)(1ABC region only)average 88.4%identities 80.7%

1AB 1C 1D 2A 2B 2C 3AB 3C 3D

5000 1000 1500 2000

HAV (n=15)average 97.9%identities 90.1%

Hepatoviruses (n=16)average 90.7%identities 38.2%

100

50

10

Sequence Identity for Aligned Parechovirus Polyproteins

Sequence Identity for Aligned Teschovirus Polyproteins

Teschoviruses (n=30)average 91.6%identities 74.7%

D

E

F

G

5000 1000 1500 2000

5000 1000 1500 2000

5000 1000 1500 2000

L 1AB 1C 1D 2AB 2C 3AB 3C 3D

5000 1000 1500 2000

100

50

Sequence Identity for Aligned Cardiovirus Polyproteins

EMCV (n=9)average 97.2%identities 92.0%

100

50

L 1AB 1C 1D 2A 2B 2C 3AB 3C 3D

5000 1000 1500 2000

ThV (n=4)average 96.3%identities 93.5%

Cardioviruses (n=13)average 77.6%identities 52.0%

5000 1000 1500 2000

100

50

100

50

100

50

100

50

HPeV (n=2)average 89.8%identities 89.8%

1AB 1C 1D 2A 2B 2C 3AB 3C 3D

5000 1000 1500 2000

Page 11: ALIGNMENTS AND COMPARATIVE PROFILESOF PICORNAVIRUS GENERA