When ELFs are ORFs, but don't act like them

2
| Letters When ELFs are ORFs, but don’t act like them Jeffrey Lawrence Pittsburgh Bacteriophage Institutes and Dept of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA When they are very small, open reading frames (ORFs) are among the most difficult features of a newly sequenced genome to annotate. Although it has been suggested that degree of conservation of these sequences among closely related genomes might assist this process, there are some classes of ORF that will defy identification because little or none of the protein sequence is under selection. In a commentary on the difficulties of proper annotation of small open reading frames (termed ELFs, for ‘evil little fellows’) in bacterial genomes, Ochman [1] suggested using the differential rates of evolution between nonsynonymous sites (K a , where alteration changes the encoded amino acid) and synonymous sites (K s , where alteration leaves the encoded protein unchanged) to assist in identification of genes in bacterial genomes. Because nonsynonymous sites are under stronger selection than synonymous sites, the ratio of K a /K s can provide a barometer for the likeli- hood that a particular region of DNA is under selection as a protein-coding region. Ochman argues that if the K a /K s ratio approaches 1.0, it is unlikely that the region encodes a protein, observing that 10 of 14 ORFs in the Escherichia coli genome showing K a /K s . 1 are denoted ‘hypothetical’, as are 90% of ORFs , 300 bp in length. This method can clearly be of use when the genome sequences of two or more closely related organisms are available for analysis, increasing the reliability of annotation of small ORFs. However, two caveats to this approach should be raised. Both stem from the fact that although a low K a /K s ratio can be strong evidence that a proposed ORF is protein coding, a high K a /K s ratio does not necessarily mean that an ORF is not protein coding. First, the threshold of K a /K s useful for separating ORFs from ELFs is not necessarily 1.0, and varies with the divergence between the genomes being compared. For example, the average K a /K s for genes shared between Escherichia coli and Salmonella enterica is 0.07, and few genes (especially those . 400 bp long) have K a /K s . 0.25 (Fig. 1a); therefore, a threshold of 1.0 is far too conservative. To identify which ELFs encode proteins, Ochman used a threshold of two standard deviations (2s) above the mean (m), or 0.389. Because the distribution of K a /K s is not Gaussian, this threshold is too conservative in this case. However, if one compares two more-closely related genomes (e.g. S. enterica serovars Typhimurium and Typhi) the threshold must be set much higher (Fig. 1b); here, the m þ 2s threshold appears too liberal. The broader distribution of K a /K s in this comparison reflects both the smaller number of substitutions available to infer rates of evolution, and the lack of time available for natural selection to remove deleterious mutations. Therefore, more flexible guidelines must be followed. Second, this approach is not useful in the identification of many small ORFs that do not evolve as one would expect protein-coding regions to evolve. That is, this method will erroneously dismiss small ORFs that are indeed Fig. 1. The distribution of K a /K s values in comparisons between Salmonella enterica serovar Typhimurium and Escherichia coli (a) and between S. enterica serovars Typhimurium and Typhi (b). Black bars, genes .400 nucleotides; white bars, all genes. Green arrows indicate the K a /K s values for the four leader peptides shown; red arrows, the mean (m); blue arrows, the threshold suggested by Ochman [1]. TRENDS in Genetics 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 leuL trpL Number of genes Number of genes K a /K s ratio K a /K s ratio 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 thrL pheL + + µ = 0.073 Threshold = 0.389 µ = 0.16 Threshold = 0.85 0 200 400 600 800 1000 0 200 400 600 800 (a) (b) Corresponding author: Jeffrey Lawrence ([email protected]). Update TRENDS in Genetics Vol.19 No.3 March 2003 131 http://tigs.trends.com

Transcript of When ELFs are ORFs, but don't act like them

Page 1: When ELFs are ORFs, but don't act like them

|Letters

When ELFs are ORFs, but don’t act like them

Jeffrey Lawrence

Pittsburgh Bacteriophage Institutes and Dept of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA

When they are very small, open reading frames

(ORFs) are among the most difficult features of a

newly sequenced genome to annotate. Although it has

been suggested that degree of conservation of these

sequences among closely related genomes might assist

this process, there are some classes of ORF that will

defy identification because little or none of the protein

sequence is under selection.

In a commentary on the difficulties of proper annotation ofsmall open reading frames (termed ELFs, for ‘evil littlefellows’) in bacterial genomes, Ochman [1] suggested usingthe differential rates of evolution between nonsynonymoussites (Ka, where alteration changes the encoded aminoacid) and synonymous sites (Ks, where alteration leavesthe encoded protein unchanged) to assist in identificationof genes in bacterial genomes. Because nonsynonymoussites are under stronger selection than synonymous sites,the ratio of Ka/Ks can provide a barometer for the likeli-hood that a particular region of DNA is under selection asa protein-coding region. Ochman argues that if the Ka/Ks

ratio approaches 1.0, it is unlikely that the region encodesa protein, observing that 10 of 14 ORFs in the Escherichiacoli genome showing Ka/Ks . 1 are denoted ‘hypothetical’,as are 90% of ORFs ,300 bp in length. This methodcan clearly be of use when the genome sequences of two ormore closely related organisms are available for analysis,increasing the reliability of annotation of small ORFs.

However, two caveats to this approach should be raised.Both stem from the fact that although a low Ka/Ks ratio canbe strong evidence that a proposed ORF is protein coding, ahigh Ka/Ks ratio does not necessarily mean that an ORF isnot protein coding. First, the threshold of Ka/Ks useful forseparating ORFs from ELFs is not necessarily 1.0, andvaries with the divergence between the genomes beingcompared. For example, the average Ka/Ks for genesshared between Escherichia coli and Salmonella entericais 0.07, and few genes (especially those .400 bp long) haveKa/Ks . 0.25 (Fig. 1a); therefore, a threshold of 1.0 is fartoo conservative. To identify which ELFs encode proteins,Ochman used a threshold of two standard deviations (2s)above the mean (m), or 0.389. Because the distribution ofKa/Ks is not Gaussian, this threshold is too conservativein this case. However, if one compares two more-closelyrelated genomes (e.g. S. enterica serovars Typhimuriumand Typhi) the threshold must be set much higher (Fig. 1b);here, them þ 2s threshold appears too liberal. The broaderdistribution of Ka/Ks in this comparison reflects both thesmaller number of substitutions available to infer rates ofevolution, and the lack of time available for naturalselection to remove deleterious mutations. Therefore,more flexible guidelines must be followed.

Second, this approach is not useful in the identificationof many small ORFs that do not evolve as one would expectprotein-coding regions to evolve. That is, this methodwill erroneously dismiss small ORFs that are indeed

Fig. 1. The distribution of Ka/Ks values in comparisons between Salmonella enterica serovar Typhimurium and Escherichia coli (a) and between S. enterica serovars

Typhimurium and Typhi (b). Black bars, genes .400 nucleotides; white bars, all genes. Green arrows indicate the Ka/Ks values for the four leader peptides shown; red

arrows, the mean (m); blue arrows, the threshold suggested by Ochman [1].

TRENDS in Genetics

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

leuL trpLNum

ber

of g

enes

Num

ber

of g

enes

Ka/Ks ratio Ka/Ks ratio

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

thrL pheL

++

µ = 0.073 Threshold = 0.389 µ = 0.16 Threshold = 0.85

0

200

400

600

800

1000

0

200

400

600

800

(a) (b)

Corresponding author: Jeffrey Lawrence ([email protected]).

Update TRENDS in Genetics Vol.19 No.3 March 2003 131

http://tigs.trends.com

Page 2: When ELFs are ORFs, but don't act like them

translated, but where the entire primary sequence of theprotein is not the information that is under selection. Inmany cases, the number of amino acids under selectioncould be few or none, which can lead to the appearance of ahigh Ka/Ks ratio, even though there is selection forexpression of the small protein sequence. For example,leader peptides are small proteins that have critical rolesin gene expression through translational control [2]. Here,pausing of the ribosome during translation of a leaderprotein may allow for the formation of an anti-terminatorRNA structure, thereby allowing transcription of thedownstream genes in an operon. If the ribosome does notpause, a rho-independent terminator will form and theoperon is not expressed. Leader peptides in the Escher-ichia coli genome control the thr, leu, trp and phe operons,and their Ka/Ks ratios are far greater than expected forprotein-coding genes (three of them are even greater thanOchman’s conservative m þ 2s threshold, Fig. 1a), whichwould seem to indicate that these small ORFs do notencode proteins. In reality, only amino acids responsiblefor ribosome pausing when a charged tRNA becomesdepleted (e.g. two tryptophan codons in the trpL gene)might be under selection for amino acid identity. Theadditional residues of the leader peptide are not underselection for the function of the protein (although they canparticipate in mRNA secondary structure formation).Similar leader peptides are found upstream of the tnaAtryptophanase gene [3,4] and chloramphenicol resistancegenes [5] and operate by different methods. Similar smallupstream ORFs occur in eukaryotic genomes (e.g. the 25-codon arginine-sensing peptide cotranscribed with theSaccharomyces cerevisiae CPA1 gene [6]). But the end resultis the same: the short length, poor conservation and oftenunusual composition of these leader peptides can easily leadone to dismiss the coding potential of their genes.

In some cases, the majority of a peptide could be dis-posable, also leading to high Ka/Ks ratios. Although smalldisposable portions are commonly seen among leadersequences for secreted proteins, or pro-proteins made ininactive states (e.g. Bacillus pro-s factors); the mostextreme case might be the pqqA peptide, which is over-produced relative to other proteins in the Klebsiella pqq

operon and may serve as the substrate for the synthesis ofthe cofactor PQQ [7,8]. Here, the amino acids glutamateand tyrosine may be cleaved from the peptide backboneand serve as the substrate for cofactor biosynthesis, andthe remaining residues may serve as a scaffold. Sometranscribed regions may not encode polypeptides at all.The Ka/Ks test is useless in the identification of smallfunctional RNAs, which can have important roles incellular metabolism, and may be great in number [9].Some of the regions designated as small ORFs – eventhose with genetic evidence for their importance – may actthrough an RNA product.

The manifold ways small protein products affectcellular metabolism make their identification onerous,and sometimes even comparisons with closely relatedgenomes cannot aid in their unambiguous identification.In these cases, hands-on experimentation could be theonly route towards gene discovery and the potentiallyfascinating insights in molecular biology that can result.

References

1 Ochman, H. (2002) Distinguishing the ORFs from the ELFs: shortbacterialgenes and theannotation ofgenomes.TrendsGenet.18,335–337

2 Yanofsky,C.(1988)Transcriptionattenuation.J.Biol.Chem.263,609–6123 Stewart, V. and Yanofsky, C. (1986) Role of leader peptide synthesis in

tryptophanase operon expression in Escherichia coli K-12. J. Bacteriol.167, 383–386

4 Gong, F. and Yanofsky, C. (2002) Analysis of tryptophanase operonexpression in vitro: accumulation of TnaC-peptidyl-tRNA in a releasefactor 2-depleted S-30 extract prevents Rho factor action, simulatinginduction. J. Biol. Chem. 277, 17095–17100

5 Lovett, P.S. and Rogers, E.J. (1996) Ribosome regulation by the nascentpeptide. Microbiol. Rev. 60, 366–385

6 Pierard, A. and Schroter, B. (1978) Structure–function relationships inthe arginine pathway carbamoylphosphate synthase of Saccharomycescerevisiae. J. Bacteriol. 134, 167–176

7 Velterop, J.S. et al. (1995) Synthesis of pyrroloquinoline quinone in vivoand in vitro and detection of an intermediate in the biosyntheticpathway. J. Bacteriol. 177, 5088–5098

8 Meulenberg, J.J.M. et al. (1992) Nucleotide sequence and structure ofthe Klebsiella pneumoniae pqq operon. Mol. Gen. Genet. 232, 284–294

9 Wassarman, K.M. et al. (2001) Identification of novel small RNAs usingcomparative genomics and microarrays. Genes Dev. 15, 1637–1651

0168-9525/03/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved.PII: S0168-9525(02)00038-0

cGMP signalling: different ways to create a pathway

Jeroen Roelofs1, Janet L. Smith2 and Peter J.M. Van Haastert3

1Department of Cell Biology, Harvard Medical School, 240 Longwood Avenue, Boston, Massachusetts 02115-5730, USA2Boston Biomedical Research Institute, 64 Grove Street, Watertown, Massachusetts 02472-2829, USA3Dept of Biochemistry, University of Groningen, Nijenborgh 4, 9747 AG Groningen, The Netherlands

Recently, a novel cGMP signalling cascade was uncovered

in Dictyostelium, a eukaryote that diverged from the line-

age leading to metazoa after plants and before yeast. In

both Dictyostelium and metazoa, the ancient cAMP-bind-

ing (cNB) motif of bacterial CAP has been modified and

assembled with other domains into cGMP-target pro-

teins. The domain structures of these cGMP targets, as

well as the enzymes responsible for cGMP synthesis and

degradation, are entirely different between DictyosteliumCorresponding author: Peter J.M. Van Haastert

([email protected]).

Update TRENDS in Genetics Vol.19 No.3 March 2003132

http://tigs.trends.com