Special Call for Papers COMPARATIVE GENOMICS Genome-wide ...

29
1 Special Call for Papers COMPARATIVE GENOMICS Genome-wide Analysis of Restriction-Modification System in Unicellular and Filamentous Cyanobacteria Fangqing Zhao 1 , Xiaowen Zhang 1 , Chengwei Liang 1 , jinyu Wu 2 , Qiyu Bao 2 , Song Qin 1 1 Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China 2 Institute of Biomedical Informatics, Wenzhou Medical College, Wenzhou, China Running head Genome-wide Analysis of RM System Corresponding author Song Qin ([email protected]) Institute of Oceanology Chinese Academy of Sciences Qingdao, 266071, China Qiyu Bao ([email protected]) Institute of Biomedical Informatics Wenzhou Medical College Wenzhou, 325000, China Articles in PresS. Physiol Genomics (December 20, 2005). doi:10.1152/physiolgenomics.00255.2005 Copyright © 2005 by the American Physiological Society.

Transcript of Special Call for Papers COMPARATIVE GENOMICS Genome-wide ...

1

Special Call for Papers COMPARATIVE GENOMICS

Genome-wide Analysis of Restriction-Modification System in Unicellular and

Filamentous Cyanobacteria Fangqing Zhao1, Xiaowen Zhang1, Chengwei Liang1, jinyu Wu2, Qiyu Bao2, Song Qin1

1Institute of Oceanology, Chinese Academy of Sciences, Qingdao, China 2Institute of Biomedical Informatics, Wenzhou Medical College, Wenzhou, China

Running head

Genome-wide Analysis of RM System

Corresponding author

Song Qin ([email protected])

Institute of Oceanology

Chinese Academy of Sciences

Qingdao, 266071, China

Qiyu Bao ([email protected])

Institute of Biomedical Informatics

Wenzhou Medical College

Wenzhou, 325000, China

Articles in PresS. Physiol Genomics (December 20, 2005). doi:10.1152/physiolgenomics.00255.2005

Copyright © 2005 by the American Physiological Society.

2

Abstract

Cyanobacteria are an ancient group of Gram-negative bacteria with strong genome size

variation ranging from 1.6Mb to 9.1 Mb. Here, we firstly retrieved all the putative

restriction-modification (RM) genes in the draft genome of Spirulina, and then performed

a range of comparative and bioinformatic analyses on RM genes from unicellular and

filamentous cyanobacterial genomes. We have identified 6 gene clusters containing

putative Type I RMs, and 11 putative Type II RMs or the solitary methyltransferases.

RT-PCR analysis reveals that 6 of 18 methyltranferases are not expressed in Spirulina,

whereas one hsdM gene, with a mutated cognate hsdS, was detected to be expressed. Our

results indicate that the number of RM genes in filamentous cyanobacteria is significantly

higher than that in unicellular species, and this expansion of RM systems in filamentous

cyanobacteria may be related to their wide range of ecological tolerance. Furthermore, a

coevolutionary pattern is found between hsdM and hsdR, with a large number of site pairs

positively or negatively correlated, indicating the functional importance of these pairing

interactions between their tertiary structures. No evidence for positive selection is found

for the majority of RMs, e.g., hsdM, hsdS, hsdR, and Type II REase gene families, while

a group of MTases exhibit a remarkable signature of adaptive evolution. Sites and genes

identified here to have been under positive selection would provide targets for further

research on their structural and functional evaluations.

Keywords

Coevolution, Comparative genomics, Horizontal gene transfer, Molecular evolution,

Spirulina platensis

3

INTRODUCTION

Cyanobacteria are among the oldest life on Earth with the capacity of oxygenic

photosynthesis similar to the process found in higher plants. Cyanobacteria may be

unicellular or filamentous, and they can be found in many different environments ranging

from marine to freshwater habitats and also in soil, on rocks and on plants. Spirulina, a

nonheterocystous, filamentous cyanobacterium, is proposed as an important mariculture

crop both for food and food supplement, and as a potential chemical factory for a variety

of useful products if effective genetic engineering can be accomplished. Unlike other

unicellular (eg. Synechococcus and Synechocysits) and filamentous cyanobacteria (eg.

Anabaena), genetic transformation of Spirulina have had limited success to date (18, 52).

The predominant barrier appears to the presence of restriction-modification (RM)

systems in Spirulina platensis, which may decrease or prevent the uptake of exogenous

DNA (53).

RM systems comprise a restriction endonuclease (REase) activity that specifically

cleaves DNA and a corresponding methyltransferase (MTase) activity that specifically

methylates the host DNA, thereby protecting it from cleavage (4, 55). RM systems can be

subdivided into four groups (I, II, III, and IV) based on the subunit composition, cofactor

requirements, sequence specificity, and cleavage position (40, 55). Moreover, those

MTases not associated with REases are termed solitary MTases. RM systems in the host

genome have been viewed as a defense mechanism against DNA invasion or as the

parasitic selfish genetic elements (22, 42). The selfish gene hypothesis is now well

supported by much evidence from genome analysis and experimentation (23). The

traditional method used to screen for restriction endonucleases is to culture individual

strains and test their extracts for the ability to produce specific fragments from small

DNA molecules (55). A large number of both filamentous and unicellular cyanobacteria

have been shown to contain one or more type II site-specific REases (29). Lyra et al. (29)

employed principal component analysis to demonstrate the relationship between certain

REases and their hosts, and found that all 28 cyanobacterial strains studied contain Type

II REases, with the most common cyanobacterial REase types, AvaII, AvaI, and AsuII.

Matveyev et al. (31) used experimental and bioinformatics approaches to give a full

insight on the MTases of the cyanobacterium Anabaena sp. PCC7120, which has four

4

Type II MTases associated with REases (AvaI-IV), and five solitary MTases.

Comparative analysis of cyanobacterial MTases revealed that the solitary ones appear to

be of ancient origin in cyanobacteria, while the Type II MTases appear to have arrived by

horizontal gene transfer (31).

Now 15 cyanobacterial genomes have been fully sequenced, 3 in the draft format,

and more than 20 are in the process of being sequenced

(http://img.jgi.doe.gov/pub/main.cgi? page=restrictedMicrobes&domain=Bacteria;

http://www.ncbi.nlm.nih.gov/). These genome-sequencing projects undoubtedly bring a

great convenience to search for novel RM genes using bioinformatic tools. We have

generated more than 7Mbp genomic sequences with 8× coverage of redundancy from the

shot-gun genome-sequencing project of Spirulina platensis (Bao et al., unpublished data).

Genome-wide identification of RM systems in Spirulina will help to develop an efficient

gene transfer system in this organism. Here, we reported all the putative RM superfamily

from the draft genome of Spirulina platensis and presented a comparative genomic

analysis of RM genes in the unicellular and filamentous cyanobacteria.

MATERIALS AND METHODS

Computational search for novel RM genes

The cyanobacteria species examined included Synechocystis, Synechococcus,

Prochlorococcus, Anabaena, Nostoc, Spirulina, and Trichodesmium. An initial set of RM

genes was obtained from CyanoBase (http://www.kazusa.or.jp/cyano/cyano.html) and

IMG v.1.1 (http://img.jgi.doe.gov/v1.1/main.cgi). This dataset, including

well-characterized and putative cyanobacterial RMs, was used to construct a query

protein set. Each protein in this query dataset was used to search the potential novel

sequences in all cyanobacteria species with whole genome sequences available, by using

the BLASTP and TBLASTN programs, with e-value<1e-10; the searches were iterated

until convergence. Hits produced by different query sequences were combined and

duplicate records, which were identified on the basis of accession, were removed.

Because RM genes tend to form clusters (55), all cyanobacterial genomic clusters

containing RM sequences were also retrieved and translated into six open reading frames

and examined manually for the presence of the RM motifs in order to discover potential

5

sequences with distant homology. Then profile recognition programs (45, 46), including

PFAM (http://pfam.wustl.edu/) and SMART (http://smart.embl-heidelberg.de/), were

used to determine the presence of RM motifs, and multiple sequence alignments were

used to finally classify a protein as RM homolog. Overall, six motifs (HSDR_N, DEXDc,

Methylase_M, Methylase_S, N6_Mtase, and MethyltransfD12) were used in this study.

All the putative RM genes identified in Spirulina genome have been submitted to the

GenBank. Accession numbers are as follows: DQ239732-DQ239748.

RT-PCR analysis of RM genes in Spirulina

Total RNA was extracted from fresh Spirulina using RNeasy Mini kit (QIAGEN,

Germany) by following manufacture’s protocol. Residual DNA in RNA preparations was

eliminated by digestion with RNase-free DNase (PROMEGA, US). The absence of DNA

was checked by PCR. The purified total RNA was reverse transcribed into cDNA using

AMV reverse transcriptase (TAKARA, China). The following primers used for RT-PCR

were designed based on the coding region of each gene: I-hsdM: forward primer

CGATGCGACGGAGTTTAG, reverse primer GCTTTTGTTCCGCCTTTT; II-hsdM:

forward primer TACTCGGCACGCTGTTTT, reverse primer

CCCTGCCAACCCTTTTAG; III-hsdM: forward primer GCGTTCCTGTTTTTGCAG,

reverse primer TGGTCTTGCCCATATTGC; IV-hsdM: forward primer

GGCATTATCGGCTTACCC, reverse primer GTGGAGGTCTTCCGGTTC; IV-MTase:

forward primer GACCCCTTTGCAGGTAGC, reverse primer

TGTTCGGAATAGGGCAGA; V-hsdM: forward primer GCCTAAATCAGCCCCATC,

reverse primer GCCTCACAATCAGCTTGC; V-MTase: forward primer

ATCACAATCACCACAGAACC, reverse primer ATGTTGAGCCAAAAAGAGC;

VI-hsdR: forward primer TATCGCCTGCTGCGTTAT, reverse primer:

GCGGAGTCCTTGGGTATC; MTase1: forward primer GACGTAGCAACACACTTTC,

reverse primer GCAAATAGAAAAGCCTTGG; MTase2: forward primer

TTAACCCACGAATCAATCC, reverse primer CGAGAGACTGTAAAACCTCC;

MTase3: forward primer GGGAAAAACTTACAACCAC, reverse primer

GGCTAATAAAGGGGGAAC; MTase4: forward primer TGCCCAGTGATAGTTTTTC,

reverse primer CCAGGGCATCAATTACC; MTase5: forward primer

TCGCCTTTATCCCTATCG, reverse primer CATACCATCAGGAGGAACC; MTase6:

6

forward primer GGCCAGAAGCTGAAACC, reverse primer

GAACCCACCAACTCAAACC; MTase7: forward primer AAATTATGCCTCCAAGC,

reverse primer CCTGCCACATATTCAGC; MTase8: forward primer

CCAAACGATAACAGCACC, reverse primer CCAGCGACAGAAAAACC; MTase9:

forward primer TTATCCGCCGCCAATTACC, reverse primer

TGTCCATTTCGCACGTCG; MTase10: forward primer TTGCAGATCCTCCCTACA,

reverse primer AATTAATTTACGTTGTGCG; MTase11: forward primer

TAGCAATTCGCTTGACC, reverse primer AGCATCACCTTGACTCC.

Multiple sequence alignment and phylogenetic analysis

Multiple sequence alignments were carried out using the Clustal W program (48),

and then adjusted manually. Graphical representations of the multiple amino acid

sequence alignments, the sequence logos, were presented using WebLogo (7). To

maximize the number of sites available for analysis, partial sequences and certain

sequences with large deletions were excluded from the analysis. The neighbor-joining

method and maximum parsimony method in MEGA3 (28) were used to construct the

phylogenetic tree, and the reliability of each branch was tested by 1,000 bootstrap

replications.

Molecular evolutionary analysis

Detecting structural interactions and statistical covariance among separate amino

acid sites is of great significance for understanding protein structure and evolution (2, 38).

Such analyses are based on the assumption that functionally significant coordinated

residues in proteins result from physicochemical interactions (e.g., hydrophobic,

electrostatic). These interactions are dependent on the physicochemical properties (e.g.,

volume, charge, and hydrophobicity) of the residues (50). Here, we estimated the

pairwise correlation of residue substitutions among several motifs in hsdM and hsdR.

This approach based on the estimation of the linear correlation coefficients between the

amino acid physicochemical property values, was fully described by Afonnikov and

Kolchanov (1). Different characteristics of amino acids (volume, hydrophobicity, and

polarity) were chosen for coevolution analysis, which reflect physical and chemical

interactions between residues. Different sequence weighting methods (14, 54) were also

used in these tests to avoid sequence bias.

7

One of the most stringent methods to detect positive selection is a comparison of the

rate of nonsynonymous substitutions (dN) to the rate of synonymous substitution (dS). The

ratio between these rates (ω= dN/dS) is a reliable measure of the selective pressure acting

on a protein-coding gene, with ω=1 indicating neutral evolution, ω<1 purifying selection,

and ω>1 positive selection (58). Likelihood ratio tests (LRTs) were performed using

Codeml with the site-specific models (M1/M2, M7/M8) of the PAML package (57).

When parameter estimates under a model allowing positive selection suggest the presence

of a number of sites with ω>1, Bayes theorem was used to calculate their posterior

probabilities (56).

RESULTS

The RM superfamily in Spirulina

In order to identify all members of the RM superfamily, we performed TBLASTN

searches of the Spirulina in-finishing genome database with RM genes for the bacterial

and archaebacterial homologs. Similar searches were run with the Pfam domains

(HSDR_N, DEXDc, Methylase_M, N6_Mtase, Methylase_S, and MethyltransfD12) of

the RM proteins to avoid the exclusion of highly diverged sequences with just limited

conserved motifs. PFAM and SMART domain analyses with the derived sequences

employed as queries were then carried out to eliminate false positives. After above

analyses, a total of 30 putative RM genes were obtained from the Spirulina genome

database (Fig.1, Supplementary Table 1). Type I RM systems comprise three closely

linked subunits, which can distinguish them obviously from other type of RM genes

without further experimental research. In this study, the symbol for Type I RM system

was hsd, thus the genes were hsdR, hsdM, and hsdS, whereas other types of RM genes or

the solitary MTases were uniformly termed as REase or MTase.

Six clusters of the Spirulina genome, ranging from 4.7kb to 11.7kb, were found

containing the putative Type I RM systems (Fig. 1). In cluster I, hsdM (688aa) and hsdS

(392aa) should be co-transcribed from the same promoter, with four nucleotide overlap.

hsdR gene was in the downstream of the hsdM/S gene partitioned by two ORFs (putative

ATP-binding protein and orf1). In cluster II, hsdS subunit was interrupted by a stop

codon (DQ239733, 2642-TAAATCGTG) in the coding frame, which resulted in two

8

separate but frame-integrated specificity domains (target recognition domains, TRDs). In

the downstream of this hsdS gene, a pin homolog encoding a site-specific recombinase

was found on the opposite strand, which could mediate DNA inversion and phase

variation in prokaryotes (10, 51). Two RM loci were found in cluster IV: one was the

putative Type I hsdM/S genes, the other was composed of two ORFs encoding the m4C

type of methyltransferase (IV-MTase) and DUF820. DUF820 sequences form a family of

hypothetical proteins from bacteria that have greatly expanded in a number of

cyanobacterial species. Kinch et al. (19) identified this family should belong to a novel

REase with the conserved motifs that make up the nuclease active site. Hence, this ORF

should be the cognate REase of IV-MTase, although it showed little significant

resemblance to any known Type II REases. All the HsdR and HsdM identified in the

Spirulina genome shared high similarity with the RM proteins in other prokaryotic

organisms. HsdS, however, shared less sequence similarity (5%-33%) from their

counterparts in cyanobacteria. Some of RM genes in Spirulina should be defective, as

judged by the nonsense mutation (e.g. II-hsdS) or the loss of conserved motifs (e.g.

V-hsdM, MTase11).

We also searched the potential type II or solitary MTases containing conserved

MTase sequence motifs (Pfam: MethyltransfD12) and the adjacent putative REases.

Since the sequences of type II MTases possess a moderate level of sequence conservation,

most of them should be identified with the computational searches. In contrast, REases

only share sequence similarity when they have similar recognition specificities (43).

Therefore, only four MTases could be identified to possess cognate REases in the

Spirulina genome (Fig.1), and whether other closely linked ORFs function as the

endonucleases can be clarified by experimental approach. Four cognate REases (REase1,

2, 6, and 7) share significant sequence similarities with known AvaIVAP (GCTNAGC, p

value=8E-37), AgeI (ACCGGT, p value= 2E-81), BsuBI (CTGCAG, p value= 2E-50),

and HindVP (GRCGYC, p value= 1E-82), respectively. These MTases can be annotated

into three groups: m4C (IV-MTase, V-MTase), m5C (MTase1-9), and m6A (MTase10,

11) based on querying the REBASE (41).

RT-PCR analysis of RM genes in Spirulina

Since gene expression patterns can provide important clues for gene function, we

9

employed RT-PCR experiments to characterize gene expression profiles of the Spirulina

methyltransferases. Previous studies revealed that many of RM systems in Anabaena sp.

7120 or Helicobacter pylori J99 were partially or completely inactive (26, 31). Similar

results were also found in the RM systems of Spirulina platensis. As shown in Fig.2, 6 of

18 methyltransferase genes were not expressed in Spirulina. Notably, the hsdM gene in

II-hsd system was expressed, whereas its cognate hsdS gene was defective, with a stop

codon in the coding frame resulting in two split TRD domains. Most of the HsdS subunits

carry two separate TRDs, providing the symmetry for their interaction with the bipartite,

asymmetric DNA target (21). Such mutation in hsdS may cause the inactivation of the

RM system. However, MacWilliams and Bickle (30) found that a truncated HsdS subunit

comprising a single TRD and a set of conserved motifs could also function as a dimer,

specifying the bipartite, symmetric DNA target. Moreover, HsdS subunits can be

exchanged between two families of Type I RMs (44). In some cases, HsdM seems to be

functional with a defective cognate HsdR, which may serve to protect the genome from

the attack by RM systems (15, 47). The expression of II-hsdM indicated that frameshift or

missense mutations in one subunit could not always lead to the inactivation of the whole

RM system. Therefore, this points out the need for a cautious approach in evaluating the

activity of each subunit in RM systems.

Genomic distribution and sequence analysis of RM genes

Computer-based nucleotide and protein searches were performed using different

BLAST search programs at NCBI (http://www.ncbi.nlm.nih.gov/BLAST) and JGI

(http://img.jgi.doe.gov/v1.1/main.cgi?page=geneSearchForm). Protein sequences of

RM genes previously described in REBASE were used as queries for database. All the

putative RMs were listed in the supplementary Table 1. To elucidate the complete

genomic structure of the RM genes, we mapped them onto the unicellular and

filamentous cyanobacterial genomes (Fig. 3). As the Spirulina platensis genomic

sequences were still in the draft format, all the putative RMs were then mapped onto the

chromosome evenly, which do not represent their actual locations.

Sequences giving the better reciprocal BLAST hits were tentatively assumed to

identify homologous counterparts in two species if they could be aligned over at least the

BLAST-Score > 90 and the E-value < 1E-10. By these criteria, all the RM domains

10

derived from the genome of Spirulina platensis, Anabaena sp. 7120, and Anabaena

variabilis ATCC 29413 were examined to have quite different sequence similarities.

Some MTases (MTase1-3) in Anabaena variabilis ATCC 29413 had more homologs with

high similarities in the other two filamentous cyanobacteria, while other MTases have

less or no counterparts in Spirulina and Anabaena sp. 7120. Moreover, no obvious

similarities (usually less than 20%) were found among the specificity genes (HsdS),

which encode polypeptide sequences responsible for the recognition of the target

sequence. A striking example of evolutionary conservation is provided by the genes

encoding Type I RMs in Anabaena sp. 7120 (hsd2 and hsd3) and Anabaena variabilis

ATCC 29413 (hsd2). The amino sequences of these gene products (HsdR and HsdM) are

nearly identical in Anabaena sp. 7120; 96% identity was preserved in HsdR/M between

Anabaena sp. 7120 and Anabaena variabilis ATCC 29413. Interestingly, all three Type I

RMs in Anabaena variabilis ATCC 29413 shared some similarities with the HsdR/M in

Anabaena sp. 7120, whereas the Type I RMs in Spirulina exhibited much less sequence

similarity with those in the Anabaena species.

Phylogenetic analysis

RMs identified from cyanobacteria and other prokaryotic organisms, such as

proteobacteria, archaea, chlorobi, firmicutes, spirochaetes, and actinobacteria, were used

to construct phylogenies to discern the evolutionary history of RM system. The 16s

rDNA phylogeny clearly showed separation of prokaryotes into several monophyletic

clades, and all these branches had significant statistical support (Fig. 5A). The hsdM and

hsdR phylogenies illustrated in Fig. 4, however, exhibited quite different topologies,

where different sources of hsd genes scattered throughout the tree. Although several

monophyletic groups were inferred with strong bootstrap support, these groups did not

correspond to their 16s rDNA phylogeny. Further, in the filamentous species of

cyanobacteria, the hsd genes were usually present in several copies, and based on the tree

topology and their genomic organization these copies might have quite different

evolutionary histories. For example, hsd2 and hsd3 of Anabaena sp. 7120 were adjacent

to each other along the chromosome, and were nearly identical at the amino acids level.

These two copies most likely have evolved through recent duplication, whereas evolution

of other RMs (hsd1, hsd4 and hsd5) is more complex and probably involves horizontal

11

gene transfers.

Unlike the wide dispersed distribution of hsd genes through the phylogeny, most of

the cyanobacterial MTases in Fig. 5B formed a separate clade (77% bootstrap value), with

the exception of six sequences (A7-MTase1, A7-MTase3, Se-MTase1, Se-MTase2,

S6-MTase2, and S6-MTase3). The archaebacterial sequences were located elsewhere on

the tree. Most of the γ-proteobacterial sequences formed a separate clade with 99%

bootstrap support. The exception was the sequences from Moraxella bovis

(PG-BAA03071) and Chromohalobacter salexigens (PG-ZP_00473324) that grouped

with other prokaryotes. Moreover, two viral MTases were clustered with four

β-proteobacterial sequences (100% bootstrap value), in which Burkholderia

(ZP_00500806) was the host of the bacteriophage phiE125 (YP_164085). The MTases in

both genomes shared a great similarity (>99%), indicating the horizontal transfer of

MTases between the bacterial host and the phage. These results evidently suggested

multiple lateral transfers of RM genes both within and between the major clades of the

tree (Fig. 4 and Fig. 5).

We also performed phylogenetic analysis separately for each Pfam domain of HsdR

and HsdM (HSDR_N, DEXDc, Methylase_M, and N6_Mtase), and phylogenies obtained

from these different domains were quite similar to the phylogenetic trees illustrated in

Fig.4 (Data not shown). By contrast, phylogenetic analyses using different domains of

HsdS gave different phylogenetic topologies with no strong bootstrap support (Data not

shown). HsdS usually contains two Pfam domain (Methylase_S) repeats, and little

similarity exists outside of these motifs. The diversification of hsdS genes to achieve

maximal variation of specificity should be driven by strong selective pressures (55).

Evolutionary analysis of RM genes

HsdM and HsdR, as with interacting domains, must co-evolve, to form an integral

and active complex with HsdS subunit. Phylogenetic trees (Fig.4) based on NJ method

showed extensively similarity in clustering of cognate hsdM and hsdR genes, also

indicating the coevolution of two subunits of Type I RMs. Previous studies revealed

several conserved motifs in HsdM, among which motif I (D/E/SXFXGXG) can bind the

AdoMet, and motif IV (N/DPPF/Y/W) can be involved in substrate binding or in the

catalytic activity (34). As for HsdR, several conserved motifs were also found in the

12

DEAD-box domain, which formed a NTP-binding pocket and a portion of nucleic

acid-binding site (13, 34).

We estimated the pairwise correlation of five domains containing above functional

motifs. Different sequence weighting methods (14, 54) used to avoid sequence bias gave

similar results, and only the results derived from the Vingron and Argos (54) approach

were shown. We firstly considered such amino acids property as volume (size) for

coevolution analysis. From Fig. 6, we can see that many site pairs are highly correlated in

a positive or negative way at the 99.9% level. In hsdM-motif I, seven sites (7-11, 13, 15)

were correlated with the sites in the other hsdR-motifs, and these sites mostly resided in

or were surrounding the featured residues. Similar results were found in the hsdM-motif

IV. Sites close in the primary sequence are likely to be close in the tertiary structure, and

are therefore likely to have coevolved. In this study, however, the observed numbers of

positively or negatively coevolving site pairs within the distant sites were significantly

higher than those close pairs, indicating that the amino acids size interactions, especially

those distant site pairs, should play a important role in the tertiary structure formation.

Another amino acids characteristic, hydrophilicity, was also used to perform the

coevolution analyses. From a structural and functional perspective, correlated amino acid

replacements may have occurred among sites to maintain the same hydrophilicity

properties, size constraints, electrostatic charges etc. Many pairs of residues showed high

correlations, and nearly 1/3 of the coevolving site pairs (70 of 212) were identical to

those computed for the residues size-effect analysis (Fig.6). The hydrophilicity effect,

however, was different with the size correlations in the hsdM-motif I, where much fewer

coevolving site pairs were found in this domain. Nearly all the sites in the hsdR-motif I,

Ia, II, III exhibited strong positive or negative correlations, revealing the great

significance of these sites to form a hydrophobic hydrolysis pocket.

To analyze the possibility of positive selection acting on specific codons in the RM

family, likelihood ratio test (LRT) was used to test for variation in the dN/dS ratio among

sites in sets of closely related RM genes. However, most of them were highly divergent,

and it was problematic to align such cases with confidence. For these reasons, we

performed most of our analyses with local clades of 10-15 sequences, which could keep

genetic divergence reasonably small and keep alignments robust for LRT analysis. In the

13

hsdM, hsdS, hsdR, and Type II REase gene families, all site-specific models provided

little or no evidence for positive selection (Data not shown).

In contrast, with the MTase gene family we obtained results indicating positive

selection, and it was estimated that a total of 10.2% sites were likely to be positively

selected (ω=31.1, p<0.001). Twelve highly homologous MTases (Sp-Mtase1, Sp-Mtase4,

Av-Mtase2, A7-Mtase5, A7-Mtase10, A7-Mtase8, S6-Mtase1, Np-402028720,

AAA50432, YP_199246, BAC81824, ZP_00517809), which could keep the alignment

robust for LRT analysis, were used to construct phylogenetic tree. Codon usage bias is

known to affect the estimation of synonymous and nonsynonymous substitution rates (59).

Thus two different codon usage bias models were used in all our analyses: F3×4 and F61.

The results under these two models were quite similar, and those under F3×4 were

presented only. To examine the relationship of sites under positive selection to the protein

structure, we mapped these sites onto the MTase tertiary structure (Fig.7). Firstly, the

sequence alignment of MTase protein sequences (m5C family) previously used for

PAML analysis, was submitted to the protein modeling server: SWISS-MODEL

(http://swissmodel.expasy.org//SWISS-MODEL.html) with PDB-1dctA (39) as the

modeling template. The derived structure using Av-MTase2 as the target sequence was

shown in Fig.7A. Several positively selected residues (482S, 484M, 494G, 507S, 508S)

were located at the MTase C-terminal extension of approximately 50 aa, where no

coordinate residues were found in the template (1dctA), and thus they could not be

mapped onto the tertiary structure. Although the other predicted positively selected sites

appeared to be relatively uniformly distributed along the protein sequence, a clear pattern

was revealed when the sites were examined in relation to the tertiary structure: they were

all exposed amino acids, and were clustered on one face of the dimer towards the nucleic

acids (Fig.7B). Two sites were located in the DNA binding pocket: one was site 376A at

the scaffold region of the small domain; another site 197T was located at one of the DNA

backbone-binding motifs adjacent to the active site (39). Thus, these sites were available

to participate in DNA-protein interactions. Moreover, other sites were located at the helix

D (206Q, 213E, 217N) or the coil linking β3 and β4 (126S, 131I).

DISCUSSIONS

14

To date, more than 100 cyanobacterial strains are known to produce restriction

endonucleases (41). Generally, only one or two restriction endonucleases were found in

each strain of cyanobacteria through conventional methods to test the crude cell extracts

for their restriction activity (29, 37). The main reason is the difficulty of detecting type I

RMs or solitary DNA methyltransferases by biochemical or genetic assay (34, 41).

Genome-wide screening of RM systems based on the genome-sequencing project,

however, leads to a virtual explosion in the number of putative RM genes, which provides

us a new and comprehensive insight into the cyanobacterial RM systems.

In this study, we have identified and analyzed a large number of putative RM genes

from all the finished or ongoing cyanobacterial genomes. From Fig.2 and supplementary

Table 1, we can see that more RM genes are found in filamentous cyanobacteria

(Anabaena, Spirulina, Nostoc) rather than in unicellular species (Synechocystis,

Synechococcus, Prochlorococcus). We also found that the bacterial or archaebacterial

genomes in REBASE with tremendously large number of RM genes are usually

pathogenic bacteria (e.g. Campylobacter jejuni, Helicobacter pylori, Neisseria

gonorrhoeae) or extremophiles (e.g. Chloroflexus aurantiacus, Methanocaldococcus

jannaschii). In spite of a wide variation in RM contents between closely related species,

these pathogenic bacteria or extremophiles have a significantly higher number of RMs

than the other species (T-test, p<0.001). It is considered that such disproportionate

expansion of certain gene families allows these organisms to adapt to a wider range of

environmental conditions, or exploit a greater variety of nutrient sources (25). When we

ruled out the chromosome size effect and calculated the relative number of

defense-related genes out of the putative COGs in cyanobacteria, it is still significantly

greater than that in unicellular species, and similar results were also found in the

cyanobacterial signal transduction systems (p<0.001, data not shown). Unicellular

Prochlorococcus, living in the oligotrophic open ocean, possesses no or a limited number

of RM systems, as well as other genes involved in environmental sensing and regulatory

systems. Dufresne et al. (9) assumed that the major driving force behind genome

reduction within the Prochlorococcus radiation has been a selective process favoring the

adaptation of this organism to its environment. Most of filamentous cyanobacteria exhibit

a wide range of ecological tolerance and are found in freshwater, marine and terrestrial

15

habitats. A good example is Nostoc punctiforme, a filamentous nitrogen-fixing

cyanobacterium, which displays an extraordinarily wide range of vegetative cell

developmental alternatives, physiological properties and ecological niches, including

broad symbiotic competences with plants and fungi (32). The identification of a new

family of putative PD-(D/E)XK nucleases in various cyanobacteria also indicated that the

expansion of such nuclease family should be correlated with their host’s ecological

plasticity (11).

Nobusato et al. (36) analyzed all the RM genes identified from the complete genome

sequences of two Helicobacter pylori strains, and found that these genes were scattered

among diverse groups of bacteria in the phylogeny. Horizontal gene transfer may have a

considerable influence on the widespread distribution of RM systems among bacteria (23).

Similar results were found in the phylogenies of cyanobacterial RMs. However, most of

the Type II or the solitary MTases in cyanobacterial genomes seemed to form a separate

clade, in contrast with the wide dispersed distribution of Type I RMs in the phylogeny.

Matveyev et al. (31) considered the solitary MTases to be the ancient orgin within

cyanobacteria. Here, the MTases that fell out of the cyanobacterial clade were all solitary

ones (Fig.5), which shared higher similarities with their homologs in proteobacteria or

firmicutes, whereas the solitary MTases in Anabaena variabilis ATCC 29413 and

Spirulina platensis were all tightly grouped with the Type II MTases, indicating the

horizontal gene transfer of the solitary MTases between cyanobacteria and other bacteria.

In Spirulina, most of the MTases were identified to be orphan genes with the absence of

their cognate REases in the closely linked loci. One explanation may be a failure

detection of the REase gene because of its high divergence or pseudogenization. The

other concern was that RM genes were sequentially transferred into the host genome,

with the modification gene being transferred first. These solitary genes may acquire a

unique function compared with other MTases in RM systems (23, 47). Most of the Type I

RM systems in cyanobacteria, however, seemed to be transferred en bloc. Three subunits

of Type I RM system, hsdM, hsdS, and hsdR, are usually closely linked. As shown in

Fig.4 and Fig.6, the similar phylogenies of HsdM and HsdR and several highly coevolved

motifs between two subunits suggest that en bloc lateral transfer of such linked genes and

cooperative evolution should be more common in the Type I RM systems. After the host

16

acquires the novel RM genes, sequence permutation or gene duplication seems to be the

major mechanism in the diversification of RM systems (5, 16).

In Type I RM system, HsdR, HsdM and HsdS form a complex (R2M2S1) to perform

both endonuclease and methyltransferase activities (4, 34, 55). In comparison with the

extensive literatures describing the conserved motifs in HsdM and HsdR and elucidating

their functions, covariation analyses of both subunits are rare. Previous studies identified

seven conserved motifs in the DEAD-Box of HsdR, and all these motifs are involved in

the ATP binding/hydrolysis reaction (13). Here, we applied the CRASP method (1) to

estimate correlation coefficients of six conserved domains in hsdM and hsdR proteins,

and found a significant (at the 99.9% level) pairwise correlation among these domains. In

hsdR subunit, motif I and II are directly involved in catalyzing the NTP hydrolysis

reaction; motifs Ia interacts with the sugar-phosphate backbone, whereas motif III is

involved in hydrogen bond and stacking interactions with the DNA bases (13). Korolev et

al. (27) found that residues in motifs Ia and III are in close proximity to a loop containing

the aspartate and glutamate residues in motif II, suggesting a potential role for these

regions in the transduction of energy from the ATPase site to the DNA molecule.

Site-directed mutations in these motifs usually result in greatly decreased ATP hydrolysis

and eliminated helicase activity (8, 13). Here, we identified that a large number of site

pairs in these motifs are positively or negatively correlated, indicating these residues

should function in forming a flexible hydrophobic hydrolysis pocket. It appears likely

that subtle changes in ATP or DNA binding characteristics might arise through minor

conformational changes associated with the structural interactions among these motifs.

However, further experiments are required to examine such amino acid changes for their

functional implications.

RM genes seem to be frequently exchanged between species (24), and to evolve

quickly after invading into the new host (17). To elucidate the selective pressure under

the diversification of RM genes, the likelihood ratio test was employed to identify

whether those common ancestral genes were under positive selection after they were

transferred into the new hosts. Unlike the ‘arm race’ effect occurring on other

defense-related genes, no signature of adaptive evolution was found in the RM systems

except some groups of MTases. Although we cannot rule out that some groups of genes in

17

these families are under positive selection, it is in fact that many genes, especially for

Type I RM systems, are pseudogenes in various states of decomposition (26, 31).

Pairwise dN/dS analysis of Type I RM genes from closely related species also showed

strong signatures of relaxed selective constraints rather than positive selection (Data not

shown). These findings strongly suggest that the evolution of RM genes does not follow

the same pattern as most of the defense-related genes.

An exception is the MTase gene family, which has a total of 10.2% sites with

elevated rates of nonsynonymous substitution. The majority of these residues could be

precisely located on the MTase tertiary structure, which provides a structural

interpretation of adaptive evolution. Two positively selected residues are located in the

DNA binding pocket, whereas the other five residues are located between motif III and

IV (126S, 131I), or between VI and VII (206Q, 213E, 217N) when they were mapped

onto the secondary structure of M.HaeIII (3, 39). Extensive studies revealed that the

variable region between motif VIII and IX determines specificity (6, 20, 33, 39).

However, Timar et al. (49) found that two amino acid replacements between motif VI and

VII resulted in relaxed recognition specificity of SinI DNA-methyltransferase, suggesting

some specificity-determining elements should be located in the large domain. LRT

analysis also suggested these regions were subjected to strong adaptive evolution, and

some residues replacements in these motifs might provide structural variations and

thereby generated novel characteristics for the recognition of new target sites. It still

remains unclear how positive selection really operated on such specificity-determining

domains. Further studies focusing on the intraspecific homologous genes would be useful

to address the role of selective pressure in such families.

ACKNOWLEDGMENTS

This work was supported by grants from Key Innovative Project (KZCX3-SW-215) of

the Chinese Academy of Sciences.

REFERENCES

1. Afonnikov DA and Kolchanov NA. CRASP: a program for analysis of coordinated

substitutions in multiple alignments of protein sequences. Nucleic Acids Res 32:

W64-W68, 2004.

2. Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, and Dress AW. Correlations

18

among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol

Biol Evol 17: 164-178, 2000.

3. Bujnicki JM. Homology modelling of the DNA 5mC methyltransferase M.BssHII. Is

permutation of functional subdomains common to all subfamilies of DNA

methyltransferases? Int J Biol Macromol 27: 195-204, 2000.

4. Bujnicki JM. Understanding the evolution of restriction-modification systems: Clues

from sequence and structure comparisons. Acta Biochim Pol 48: 935-967, 2001.

5. Bujnicki JM. Sequence permutations in the molecular evolution of DNA

methyltransferases. BMC Evol Biol 2: 3, 2002.

6. Cheng X, Kumar S, Posfai J, Pflugrath JW, and Roberts RJ. Crystal structure of

the HhaI methyltransferase complexed with S-adenosylmethionine. Cell 74: 299-307,

1993.

7. Crooks GE, Hon G, Chandonia JM, and Brenner SE. WebLogo: A sequence logo

generator [online]. http://weblogo.berkeley.edu [2004].

8. Davies GP, Powell LM, Webb JL, Cooper LP, and Murray NE. EcoKI with an

amino acid substitution in any one of seven DEAD-box motifs has impaired ATPase and

endonuclease activities. Nucleic Acids Res 26: 4828-4836, 1998.

9. Dufresne A, Garczarek L, and Partensky F. Accelerated evolution associated with

genome reduction in a free-living prokaryote. Genome Biol 6: R14, 2005.

10. Enomoto M, Oosawa K, and Momota H. Mapping of the pin locus coding for a

site-specific recombinase that causes flagellar-phase variation in Escherichia coli K-12. J

Bacteriol 156: 663-668, 1983.

11. Feder M and Bujnicki JM. Identification of a new family of putative PD-(D/E)XK

nucleases with unusual phylogenomic distribution and a new type of the active site. BMC

Genomics 6: 21, 2005.

12. Guex N, Diemand A, and Peitsch MC. Protein modelling for all. Trends Biochem

Sci 24: 364-367, 1999.

13. Hall MC and Matson SW. Helicase motifs: the engine that powers DNA

unwinding. Mol Microbiol 34: 867-877, 1999.

14. Henikoff S and Henikoff JG. Position-based sequence weights. J Mol Biol 243:

574-578, 1994.

19

15. Ibanez M, Alvarez I, Rodriguez-Pena JM, and Rotger R. A ColE1-type plasmid

from Salmonella enteritidis encodes a DNA cytosine methyltransferase. Gene 196:

145-158, 1997.

16. Jeltsch A. Circular permutations in the molecular evolution of DNA

methyltransferases. J Mol Evol 49: 161-164, 1999.

17. Jeltsch A and Pingoud A. Horizontal gene transfer contributes to the wide

distribution and evolution of type II restriction-modification systems. J Mol Evol 42:

91-96, 1996.

18. Kawata Y, Yano S, Kojima H, and Toyomizu M. Transformation of Spirulina

platensis strain C1 (Arthrospira sp. PCC9438) with Tn5 transposase-transposon

DNA-cation liposome complex. Mar Biotechnol 6: 355-363, 2004.

19. Kinch LN, Ginalski K, Rychlewski L, and Grishin NV. Identification of novel

restriction endonuclease-like fold families among hypothetical proteins. Nucleic Acids

Res 33: 3598-3605, 2005.

20. Klimasauskas S, Nelson JL, and Roberts RJ. The sequence specificity domain of

cytosine-C5 methylases. Nucleic Acids Res 19: 6183-6190, 1991.

21. Kneale GG. A symmetrical for the domain structure of type I DNA

methyltransferases. J Mol Biol 243: 1-5, 1994.

22. Kobayashi I. Behavior of restriction-modification systems as selfish mobile elements

and their impact on genome evolution. Nucleic Acids Res 29: 3742-3756, 2001.

23. Kobayashi I. Restriction-modification systems as minimal forms of life. In:

Restriction endonucleases, edited by Pingoud A. Berlin: Springer-Verlag, 2004, p.19-62.

24. Kobayashi I, Nobusato A, Kobayashi-Takahashi N, and Uchiyama I. Shaping the

genome restriction-modification systems as mobile genetic elements. Curr Opin Genet

Dev 9: 649-656, 1999.

25. Konstantinidis KT and Tiedje JM. Trends between gene content and genome size

in prokaryotic species with larger genomes. Proc Natl Acad Sci USA 101: 3160-3165,

2004.

26. Kong H, Lin LF, Porter N, Stickel S, Byrd D, Posfai J, and Roberts RJ.

Functional analysis of putative restriction-modification in the Helicobacter pylori J99

genome. Nucleic Acids Res 28: 3216-3223, 2000.

20

27. Korolev S, Hsieh J, Gauss GH, Lohman TM, and Waksman G. Major domain

swiveling revealed by the crystal structures of complexes of E.coli Rep helicase bound to

single-stranded DNA and ADP. Cell 90: 635-647, 1997.

28. Kumar S, Tamura K, and Nei M. MEGA3: Integrated software for molecular

evolutionary genetics analysis and sequence alignment. Brief Bioinform 5: 150-163,

2004.

29. Lyra C, Halme T, Torsti A-M, Tenkanen T, and Sivonen K. Site-specific

restriction endonuleases in cyanobacteria. J Appl Microbiol 89: 979-991, 2000.

30. MacWilliams MP and Bickle TA. Generation of new DNA binding specificity by

truncation of the type IC EcoDXXI hsdS gene. EMBO J 15: 4775-4783, 1996.

31. Matveyev AV, Young KT, Meng A, and Elhai J. DNA methyltransferases of the

cyanobacterium Anabaena PCC7120. Nucleic Acids Res 29: 1491-1506, 2001.

32. Meeks JC, Elhai J, Thiel T, Potts M, Larimer F, Lamerdin J, Predki P, and

Atlas R. An overview of the genome of Nostoc punctiforme, a multicellular symbiotic

cyanobacterium. Photosynth Res 70: 85-106, 2001.

33. Mi S and Roberts RJ. How M.MspI and M.HpaII decide which base to methylate.

Nucleic Acids Res 20: 4811-4816, 1992.

34. Murray NE. Type I restriction systems: sophisticated molecular machines (a legacy

of Bertani and Weigle). Microbiol Mol Biol Rev 64: 412-434, 2000.

35. Nicholls A, Sharp KA, and Honig B. Protein folding and association: insights from

the interfacial and thermodynamic properties of hydrocarbons. Proteins: Stuct Func and

Genet 11: 281-296, 1991.

36. Nobusato A, Uchiyama I, and Kobayashi I. Diversity of restriction-modification

gene homologues in Helicobacter pylori. Gene 259: 89-98, 2000.

37. Piechula S, Waleron K, Swiatek W, Biedrzycka I, and Podhajska AJ. Mesophilic

cyanobacteria producing thermophilic restriction endonucleases. FEMS Microbiol Lett

198: 135-140, 2001.

38. Pollock DD and Taylor WR. Effectiveness of correlation analysis in identifying

protein residues undergoing correlated evolution. Protein Eng 10: 647-657, 1997.

39. Reinisch KM, Chen L, Verdine GL, and Lipscomb WN. The crystal structure of

HaeIII methyltransferase convalently complexed to DNA: an extrahelical cytosine and

21

rearranged base pairing. Cell 82: 143-153, 1995.

40. Roberts RJ, Belfort M, Bestor T, Bhagwat AS, Bickle TA, Bitinaite J, Blumenthal

RM, Degtyarev SK, Dryden DTF, Dybvig K, Firman K, Gromova ES, Gumport RI,

Halford SE, Hattman S, Heitman J, Hornby DP, Janulaitis A, Jeltsch A, Josephsen J,

Kiss A, Klaenhammer TR, Kobayashi I, Kong H, Kruger DH, Lacks S, Marinus MG,

Miyahara M, Morgan RD, Murray NE, Nagaraja V, Piekarowicz A, Pingoud A,

Raleigh E, Rao DN, Reich N, Repin VE, Selker EU, Shaw P, Stein DC, Stoddard BL,

Szybalski W, Trautner TA, Van Etten JL, Vitor JMB, Wilson GG, and Xu S. A

nomenclature for restriction enzymes, DNA methyltransferases, homing endonucleases

and their genes. Nucleic Acids Res 31: 1805-1812, 2003.

41. Roberts RJ, Vincze T, Posfai J, and Macelis D. REBASE - Restriction enzymes and

DNA methyltransferases. Nucleic Acids Res 33: D230-D232, 2005.

42. Rocha EPC, Danchin A, and Viari A. Evolutionary role of restriction/modification

systems as revealed by comparative genome analysis. Genome Res 11: 946-958, 2001.

43. Ruan H, Lunnen KD, Scott ME, Moran LS, Slatko BE, Pelletier JJ, Hess EJ,

Benner J, Wilson GG, and Xu S-Y. Cloning and sequence comparison of AvaI and

BsoBI restriction-modification systems. Mol Gen Genet 252: 695-699, 1996.

44. Schouler C, Gautier M, Duusko Ehrlich S, and Chopin MC. Combinational

variation of restriction modification specificities in Lactococcus lactis. Mol Microbiol 28:

169-178, 1998.

45. Schultz J, Milpetz F, Bork P, and Ponting CP. SMART, a simple modular

architecture research tool: Identification of signaling domains. Proc Natl Acad Sci USA

95: 5857-5864, 1998.

46. Sonnhammer ELL, Eddy SR, Birney E, Bateman A, and Durbin R. Pfam:

Multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res

26: 320-322, 1998.

47. Takahashi N, Naito Y, Handa N, and Kobayashi I. A DNA methyltransferase can

protect the genome from postdisturbance attack by a restriction-modification gene

complex. J Bacteriol 184: 6100-6108, 2002.

48. Thompson JD, Higgins DG, and Gibson TJ. CLUSTAL W: Improving the

sensitivity of progressive multiple sequence alignment through sequence weighting,

22

position-specific gap penalties and weight matrix choice. Nucleic Acids Res

22:4673-4680, 1994.

49. Timar E, Groma G, Kiss A, and Venetianer P. Changing the recognition specificity

of a DNA-methyltransferase by in vitro evolution. Nucleic Acids Res 32: 3898-3903,

2004.

50. Tomii K and Kanehisa M. Analysis of amino acid indices and mutation matrices for

sequence comparison and structure prediction of proteins. Protein Eng 9:27-36, 1996.

51. Tominaga A. The site-specific recombinase encoded by pinD in Shigella dysenteriae

is due to the presence of a defective Mu prophage. Microbiol 143: 2057-2063, 1997.

52. Toyomizu M, Suzuki K, Kawata Y, Kojima H, and Akiba Y. Effective

transformation of the cyanobacterium Spirulina platensis using electroporation. J Appl

Phycol 13: 209-214, 2001.

53. Tragut V, Xiao J, Bylina EJ, and Borthakur D. Characterization of DNA

restriction-modified systems in Spirulina platensis strain pacifica. J Appl Phycol 7:

561-564, 1995.

54. Vingron M and Argos P. A fast and sensitive multiple sequence alignment algorithm.

CABIOS 5: 115-121, 1989.

55. Wilson GG and Murray NE. Restriction and modification systems. Annu Rev Genet

25: 585-627, 1991.

56. Wong WSW, Yang ZH, Goldman N, and Nielsen R. Accuracy and power of

statistical methods for detecting adaptive evolution in protein coding sequences and for

identifying positively selected sites. Genetics 168: 1041-1051, 2004.

57. Yang ZH. PAML: a program package for phylogenetic analysis by maximum

likelihood. CABIOS 13: 555-556, 1997.

58. Yang Z and Bielawski JP. Statistical methods for detecting molecular adaptation.

Trends Ecol Evol 15: 496-503, 2000.

59. Yang Z, Nielsen R, Goldman N, Pedersen AM. Codon substitution models for

heterogeneous selection pressure at amino acid sites. Genetics 15: 1600-1611, 2000.

23

Figure 1. Organization of the gene clusters potentially encoding RM system of Spirulina platensis.

Arrows represent the direction of translation and the relative sizes of ORFs deduced from analysis of

the nucleotide sequence. Gene names are given on top of the corresponding region. White arrows

indicate the flanking ORFs; other types of arrows indicate different subfamilies of RM system. “×” in

the arrow indicates the nonsense codon. Gene clusters encoding Type I RM genes are marked with

Roman numerals (I-VI); the putative Type II RM genes or solitary MTases are marked with Arabic

numerals (1-11).

24

Figure 2. RT-PCR analysis of the expression of RM genes in Spirulina platensis. See Materials

and Methods for details.

25

Figure 3. Genomic organization of RM genes in unicellular (D: Synechocystis sp. 6803, E: Synechococcus elongtus) and filamentous

cyanobacteria (A: Spirulina platensis, B: Anabaena variabilis ATCC 29413, C: Anabaena sp. 7120). Vertical bars show the locations of RM genes,

with different colors representing different Pfam domains: red- Methylase_S, light green- HSDR_N, purple- DEXDc, blue- Methylase_M, black-

N6_Mtase, orange- MethyltransfD12, dark green- REase (not the Pfam nomenclature). The long horizontal line indicates the chromosome, whereas the

short horizontal lines denote the extranuclear plamids. A vertical bar above a horizontal line indicates the transcriptional direction opposite to that below

a horizontal line. The positions and orders of RM genes in B-E were drawn based on their genome assemblies. Note that Spirulina genome (A) was still

in draft format, and RM genes were mapped onto the chromosome evenly. Connecting lines represent similar domains (Blastp search, score>90 and

E-value<e-9) between genomes. Sequence similarities of among Type I RM genes from C/D/E were implied from the dendrogram constructed through

the NJ method, shown in black. The value in this dendrogram represents the bootstrap value.

26

Figure 4. The concordance of HsdM (left) and HsdR (right) protein phylogeny constructed with

NJ method. Dashed lines link the cognate hsdM and hsdR genes in the same gene cluster. Colored

branches and symbols indicate different sources of RM genes (red: α-proteobacteria, light green:

β-proteobacteria, brown: γ-proteobacteria, purple: δ-proteobacteria, light blue: ε-proteobacteria, green:

firmicutes, grey blue: spirochaetes, grey: chlorobi, blue: cyanobacteria, dark: archaea).

27

Figure 5. Unrooted NJ tree for 16s rDNA (A) and MTase (B) in prokaryotic organisms. Colored

branches and symbols indicate different sources of RM genes (red: α-proteobacteria, light green:

β-proteobacteria, brown: γ-proteobacteria, purple: δ-proteobacteria, light blue: ε-proteobacteria, green:

firmicutes, grey blue: spirochaetes, grey: chlorobi, blue: cyanobacteria, dark: archaea).

28

Figure 6. Pairwise matrix representation of the coevolving site pairs in hsdM and hsdR. Two

amino acids characteristics, size (below the diagonal) and hydrophilicity (above the diagonal), were

used for coevolution analysis. Red color in the matrix indicates the positively coevolving site pairs;

blue color is for the negatively coevolving site pairs. The critical value for the correlation coefficients

is 0.5253 at the 99.9% level. Logos representation of an alignment of the hsdM-motif I and IV, and

hsdR-motif I, Ia, II, III from the sequences listed in Fig.4. Several mostly conserved sites, which were

excluded from coevolution analysis, were added below the logo alignment. The invariable residues

were in capital letter, and variable sites were in lowercase, with the short and black line indicating

their locations. Blue shading shows the coevolving site pairs between hsdM and hsdR, which were

fully discussed in the text.

29

Figure 7. Mapping of the positively selected sites (in blue color) in MTase onto their tertiary

structures. (A) Ribbon diagram of AV-Mtase2 derived from homology modeling. (B) A

representation of the molecular surface of the MTase dimer. DNA molecules are in orange color.

Swiss-PDBviewer (12) and Grasp2 (35) were used for all structural manipulations.