Supplementary Materials for -...
-
Upload
nguyenhuong -
Category
Documents
-
view
228 -
download
0
Transcript of Supplementary Materials for -...
www.sciencemag.org/cgi/content/full/339/6124/1207/DC1
Supplementary Materials for
Gene Transfer from Bacteria and Archaea Facilitated Evolution of an Extremophilic Eukaryote
Gerald Schönknecht,* Wei-Hua Chen, Chad M. Ternes, Guillaume G. Barbier, Roshan P. Shrestha, Mario Stanke, Andrea Bräutigam, Brett J. Baker, Jillian F. Banfield, R. Michael Garavito, Kevin Carr, Curtis Wilkerson, Stefan A. Rensing, David Gagneul, Nicholas E.
Dickenson, Christine Oesterhelt, Martin J. Lercher, Andreas P. M. Weber*
*To whom correspondence should be addressed. E-mail: [email protected] (G.S.);
[email protected] (A.P.M.W.)
Published 8 March 2013, Science 339, 1207 (2013) DOI: 10.1126/science.1231707
This PDF file includes:
Materials and Methods Supplementary Text Figs. S1 to S27 Tables S1 to S3 Caption for data table S4 References (21–79)
Other Supplementary Material for this manuscript includes the following: (available at www.sciencemag.org/cgi/content/full/339/6124/1207/DC1)
Data table S4 (Microsoft Excel file)
2
Materials and Methods
Strains and Media Galdieria sulphuraria strain 074W was cultivated axenically in minimal mineral medium
supplemented with 25 mM Gal at 37°C in the dark as described previously (3).
DNA isolation and sequencing G. sulphuraria cells were harvested from heterotrophic cultures and ground to a fine
powder with a mortar and a pestle. Total nucleic acids were extracted by incubating the ground tissue overnight in 50 mM Tris-Cl pH7.5, 5 mM EDTA, and 1% (w/v) SDS, followed by extraction of proteins with phenol:chloroform:isoamylalcohol (24:24:1), and precipitation of DNA from the aqueous phase by ethanol. The pellet was dissolved in 10 mM Tris-Cl pH 7.5, 1 mM EDTA, and RNA was removed by incubation with DNAse-free RNAse, followed by de-proteination with phenol:chloroform:isoamylalcohol, and DNA-precipitation by ethanol. Nuclear DNA was further purified by CsCl-density gradient centrifugation of bis-benzamide-treated total DNA (21).
Three different libraries were generated for sequencing of genomic DNA: (i) a small-insert (approx. 2 kB inserts) shotgun sequencing plasmid library that was constructed in pSMART-HC Kan (Lucigen, Middleton, WI; www.lucigene.com); (ii) a Fosmid library containing 40 kbp inserts was constructed by physically shearing genomic DNA and ligation of end repaired, size fractionated DNA into the Fosmid vector pCC1FOS, followed by packing into Lambda phages. (iii) two BAC libraries (BamHI, HindIII) containing > 100 kbp inserts were constructed by the TAMU GENEfinder Genomics Resource Center (Texas A&M University, College Station, TX).
DNA sequencing was conducted using fluorescence-labeled dye terminators on an ABI 3730xl capillary DNA sequencing systems. Sequence data and chromatograms were stored on a Geospiza Finch server (Geospiza, Seattle, WA). After cleaning the raw sequence data (base quality > Q20, exclusion of contaminating vector or bacterial sequences), 1,769 and 1,699 sequence reads were generated from the BamH1 and HindIII BAC libraries, respectively, 5,817 sequence reads were generated from the Fosmid library, and 190,911 sequence reads were generated from three different small-insert libraries. A total of 147,097,538 Q20-bases were thus sequenced by Sanger technology, yielding approx. 10-fold genome coverage by Sanger-sequenced Q20-bases, based on an approximate genome size of 14 Mb. In addition, we generated 163,727 sequence reads (8.55 x genome coverage) from physically sheared genomic DNA using a GS20 Genomic Sequencer (Roche, Indianapolis, IN). A hybrid (i.e., Sanger and 454) assembly using the ARACHNE genome assembler (22) yielded 433 scaffolds with an N50 of 172,322 bases, giving a total scaffold length of 13712004 bases of which 292650 bases (2.1%) are gaps. Sequencing of mRNA as described previously (23).
Generation of gene models and annotation of protein coding genes
The AUGUSTUS (24) gene prediction program was used for the prediction of genes on the ARACHNE-generated scaffolds. AUGUSTUS was trained with EST-sequences generated by Sanger-technology (25) and by mRNA sequence data generated from two different normalized libraries (cells grown autotrophically and heterotrophically) using a GS20 sequencing system. The genes were then predicted with AUGUSTUS (24) integrating evidence from the mRNA sequencing and from proteins of C. merolae, but also allowing for ab initio predicted models in absence of such data. As a result 6,623 protein coding genes were predicted, 551 with two splice
3
variants, and three with three splice variants. Routine analyses were performed with splice variant 1, which is most confident according to the model of AUGUSTUS. Functions were assigned to the 6,623 predicted major isoforms of proteins based on sequence similarity to annotated genes from NCBI non-redundant (nr) protein database (26), NCBI Conserved Domain Database (CDD) (27), Arabidopsis thaliana genome (28), C. merolae genome (7), KEGG (29), and UniProt (30) databases. In addition, Gene Ontology (GO) (31), Mercator (32), InterProScan (33), and HHpred (34) were used to identify protein functions. Similarity searching was done with the BLAST program with soft masking being turned on (-F “m S”) (35). A BLAST bit-score cutoff of 50 was used, unless stated otherwise. Membrane transporter proteins were annotated with the TransAAP tool from TransportDB (36), with TransportTP (37), and by BLAST against the Transporter Classification Database (TCDB) (38). SignalP 4.0 (39) was used to search for secretory signal peptides. To test for enrichment of certain functional categories in subsets of annotated proteins High-Throughput GoMiner (40) was used. For this analysis G. sulphuraria proteins were substituted by UniProt (30) identifiers of best BLAST hits.
Generation of protein families
G. sulphuraria gene families were constructed to investigate both the size and functions of proteins associated with these families. We performed an all-against-all BLAST (35) using protein sequences in G. sulphuraria, parsed the bit scores from the BLAST output and submitted them to MCL (Markov Clustering; run with default parameters) to build gene families. Each resulting gene family contains either multiple proteins (table S2) that were derived from a common ancestor through duplication events or only a single protein (singlet). InParanoid version 2 (41) with default parameters was used to search for orthologous relationships between proteins in G. sulphuraria and proteins from 208 sequenced genomes (http://dx.doi.org/10.5061/dryad.84r5q) of common model organisms, thermophilic bacteria and archaea, as well as genomes containing top BLAST hits of G. sulphuraria proteins in the NCBI nr database (26). We chose orthologous groups with score 1 for further analysis when testing for horizontal gene transfer (HGT; see below). To investigate the evolutionary histories of G. sulphuraria protein families, we used a method similar to Merchant et al. (42) to expand G. sulphuraria gene families with orthologous proteins (identified using InParanoid, see above) plus paralogous proteins of the identified orthologs from 43 eukaryotic genomes (http://dx.doi.org/10.5061/dryad.84r5q). These expanded protein families were used for phylogenetic analyses and to compare expansion or reduction of a specific protein family among different eukaryotic species.
Identification of horizontal gene transfer (HGT) candidates
All HGT candidates were identified by phylogenetic analyses. Detailed phylogenetic trees were constructed manually (see below) for most large protein families (i.e., those with ten or more paralogs; table S2), for some protein families with best BLAST hits (NCBI, nr) in bacterial or archaeal sequences, and for some protein families with protein domains (PFAM) and/or annotations characteristic for Bacteria or Archaea. This resulted in the identification of several HGT candidates.
In addition, a genome-wide, systematic screen for HGT candidates was performed using bioinformatics methods. To estimate a lower bound for HGT into G. sulphuraria, we aimed to identify only unequivocal cases of HGT. Proteins giving only hits in bacterial or archaeal sequences when blasted against our 208-species database were further analyzed as described
4
below (next paragraph a) to e)). For proteins giving best BLAST hits in bacterial or archaeal sequences in addition to hits in eukaryotic sequences, we performed a phylogenetic analysis using the following procedure. A multiple sequence alignment (MSA) for each of the proteins and their orthologous proteins (identified using InParanoid) from the same 208 genomes was constructed using MUSCLE (43), with the maximum number of iterations set to 100, followed by GBLOCKS (44) with parameters ‘-b3=8 –b4=2 –n=y’ to remove poorly aligned regions. We selected the best protein evolution model for each MSA using ProtTest (45), and used it to reconstruct the phylogenetic relationships for the proteins in the MSA with PhyML (46). To select putative HGT candidates with statistical significance, we performed a RELL analysis implemented in the ‘codeml’ command of the PAML package (47) on each phylogenetic tree indicating HGT. This test compares the protein family tree obtained from our analysis, indicating HGT from a bacterium or archaeon into G. sulphuraria, with the best tree that enforces monophyly of eukaryotic sequences. HGT is only assumed if the tree enforcing monophyly of eukaryotic sequences gets a RELL bootstrap support (pRELL) of 5% or less out of 10,000 replicates in this analysis. Following this systematic bioinformatics screen, the phylogenetic tree for each statistically significant (according to the RELL analysis) HGT candidate was manually checked and only accepted when a clear pattern of horizontal gene transfer was observed.
Very stringent criteria were used for our phylogenetic analyses: a) Sequences that were too short (<150 amino acids) to build reliable MSAs were not accepted; b) phylogenetic trees with less than ten species were excluded: c) genes that were potentially transferred from cyanobacteria were only accepted as HGT candidates when homologs were absent from other photosynthetic eukaryotes, and when the annotation did not indicate a function in photosynthesis, to discriminate against endosymbiotic gene transfer; d) in cases where a corresponding phylogenetic tree did not allow conclusions about the origin of a gene in G. sulphuraria the gene was removed; e) in order to check whether a potential HGT candidate had significant sequence similarity with proteins from eukaryotic species not included in the 208 genomes used for the systematic bioinformatics screen, each potential HGT candidate was submitted to the NCBI Web BLAST service (nr) and a tree of best BLAST hits was generated by the Tree View option; HGT candidates that were not confirmed by a tree of best BLAST hits were not accepted. In cases where an identified HGT candidate was a member of a protein family with two or more paralogs, and the other paralogs had not been identified as HGT candidates, a detailed phylogenetic tree was constructed for the complete protein family to decide which family members were HGT candidates. To facilitate manual inspection of the phylogenetic trees from the systematic bioinformatics screen, we colored tree branches and leaf labels with predefined rules using ColorTree (48).
Data analysis and generation of phylogenetic trees
Non-linear regression analysis was performed with GraphPad Prism 5.01 (GraphPad Software Inc., La Jolla, CA). Two-tail P values for Fisher's exact tests were calculated as defined by Agresti (49) (at http://www.langsrud.com/stat/fisher.htm ). To generate the phylogenetic trees presented as figures, homologous sequences from protein families (see above) and sequences resulting from BLAST (35) searches (NCBI non-redundant (nr) protein database (26) and KEGG (29)) were collected using MEGA (50). At NCBI BLAST runs were carried out separately for different clades to improve broad phylogenomic sampling. Protein sequences from bacteria or archaea that were identified as potential ‘donor’ organisms for HGT into G. sulphuraria were used as BLAST queries to further improve the resolution of phylogenetic trees. Incomplete or
5
highly redundant sequences (>90% amino acid identity) were removed. Multiple sequence alignments were generated with T-Coffee in the ‘mcoffe’ and the ‘accurate’ mode (51). High scoring portions of multiple sequence alignments (T-Coffee score 5 to 9) were extracted (44), followed by comparison of the two resulting alignments (‘mcoffee’ versus ‘accurate’). The better alignment, based on scores (T-Coffee scores from 0 to 100), length, number of invariant positions, and visual inspection, was used to construct a phylogenetic tree. An estimation which model of protein evolution best fits the multiple sequence alignment was generated by ProtTest (45). The best models were used to generate phylogenetic trees with PhyML 3.0 (46) estimating branch support values by non-parametric bootstrap with 200 replicates, and with the MPI version of MrBayes 3.1 (52). MrBayes was run with twelve chains for 5,000,000 generations sampling every 100th generation; the first 25% of samples were ignored when calculating parameters and consensus tree. Phylogenetic trees were visualized in Dendroscope (53), MEGA (50), and EvolView (54). Only support values >50% are given.
Supplementary Text
Early divergence of G. sulphuraria and C. merolae The low degree of colinearity between the genomes of G. sulphuraria and C. merolae (fig.
S2) indicates an early divergence of these two species. To obtain a rough estimate as to when G. sulphuraria and C. merolae diverged, percent amino acid identity of orthologous gene pairs was compared for genomes from different pairs of species, and normalized cumulative frequencies were plotted (fig. S1). At 50% percent cumulative frequency (median value), amino acid identity for annotated proteins from the two red algal genomes was 44.9%, compared to 45.4% for the species pair Homo sapiens / Drosophila melanogaster. Taking percent amino acid identity as a rough measure for divergence time, this means that G. sulphuraria and C. merolae split slightly earlier (lower amino acid identity) than H. sapiens and D. melanogaster did. The divergence between vertebrates and insects is dated to about 910 (±300) million years ago (55), indicating that - assuming comparable mutation rates - G. sulphuraria and C. merolae diverged about 900 million years ago.
This early divergence is further supported by a comparison of all annotated proteins in the genomes of G. sulphuraria and C. merolae. Only 1,259 annotated proteins (19%) from G. sulphuraria have proteins from C. merolae as best BLAST hits, and only slightly more than 60% (4,017 proteins) give a significant BLAST hit (score >50) with C. merolae. The latter percentage is comparable to that of BLAST hits with proteins from Arabidopsis thaliana (4,032 proteins), emphasizing the early divergence of the two red algae These data are in agreement with phylogenetic studies of the Cyanidiophyceae showing a very early divergence of the lineages leading to G. sulphuraria and C. merolae (56, 57). Based on estimates that the Cyanidiophyceae split from the other red algae (Rhodophyta) approximately one billion years ago (57, 58), this also indicates that G. sulphuraria and C. merolae probably diverged about 900 million years ago.
Horizontal gene transfer (HGT)
While HGT has been observed for unicellular eukaryotes (59), its scope and function in eukaryotic evolution are not well understood. To develop an idea about the extent of horizontal gene transfer in G. sulphuraria, and to statistically confirm HGT candidates, an independent, genome-wide screen combining bioinformatics methods and manual inspection was performed (see above). This search was limited to genes that were acquired specifically by
6
Cyanidiophyceae (G. sulphuraria and C. merolae), and in most cases the genes were only observed in G. sulphuraria and no other eukaryotic organism. To be conservative, we excluded genes that might have been acquired before red alga and green plants split (60) and therefore occur in Cyanidiophyceae as well as green algae and/or land plants. Moreover, we included only genes that most likely have originated from Bacteria or Archaea. As outlined above, we attempted to exclude endosymbiotic gene transfer from cyanobacteria, i.e., the massive gene transfer from the cyanobacterial genome of the evolving plastid into the nuclear genome of the host cell during primary endosymbiosis (61). It seems more likely that a gene of potential cyanobacterial origin only observed in G. sulphuraria originates from HGT, compared to the alternative of endosymbiotic gene transfer followed by loss in all other eukaryotic photosynthetic lineages. Moreover, for genes potentially transferred from cyanobacteria to G. sulphuraria, the phylogenetic position of the descendant of the potential ‘donor’ clade (see below; table S4) was compared with the recently published phylogenetic position of the primary plastid within the cyanobacterial tree (62). HGT candidates for which the potential donor was a close neighbor of the primary plastid were excluded. Similarly, it is unlikely that HGT candidates result from endosymbiotic gene transfer that occurred in the process of mitochondrial evolution after the primary endosymbiosis of an alphaproteobacterium. HGT candidates (only observed in G. sulphuraria) of alphaproteobacterial origin seem more likely to originate from HGT, as the alternative hypothesis of endosymbiotic gene transfer requires subsequent losses in all other eukaryotic lineages. Moreover, no HGT candidate shows similarity to sequences from Rickettsia, which are believed to be similar to the primary endosymbiont (63) (table S4).
178 G. sulphuraria proteins had significant BLAST hits (score > 50) only in bacterial or archaeal sequences, meaning that no RELL analysis (see above) could be performed due to a lack of eukaryotic homologs. Out of those 178 proteins, 110 were accepted as HGT candidates after further inspection using the NCBI Web BLAST service (nr) combined with the Tree View option and criteria as described above. Phylogenetic analyses indicated that those 110 HGT candidates resulted from 25 HGT events. Members of the large Archaeal ATPase families with numerous paralogs (see fig. S11) were included here. 618 proteins had best BLAST hits in bacterial or archaeal sequences. Out of those, RELL analyses for maximum likelihood phylogenetic trees confirmed horizontal gene transfer for 163 proteins with statistical significance (5% significance level). From these 163 proteins, 50 were accepted as HGT candidates after further inspection. Phylogenetic analyses indicated that those 50 HGT candidates resulted from 44 HGT events.
Genes were excluded as HGT candidates during further inspection in case the encoded protein a) was too short (<150 amino acids); b) had too few BLAST hits; c) potentially originated from endosymbiotic gene transfer; d) resulted in a phylogenetic tree that did not allow conclusions; e) had significant sequence similarity with proteins from eukaryotic species not included in the 208 genomes used for the systematic bioinformatics screen, resulting in a phylogenetic tree of best BLAST hits (NCBI nr) that did not confirm HGT. These very stringent criteria for the manual inspection of each HGT candidate were aimed at preventing false positives as far as possible. Out of eight cases where a more detailed manual phylogenetic analysis was performed for a protein family that contained HGT candidates identified by the systematic bioinformatics screen, only one phylogenetic tree indicated a false positive (probably caused by limited phylogenomic sampling).
Together with HGT candidates identified during detailed, manual phylogenetic analyses of large protein families and of protein families showing obvious similarity to bacterial or archaeal
7
proteins, a total of 337 genes resulting from 75 horizontal gene transfers were detected (see table S4). The genome-wide bioinformatics screen detected 69 HGT events, and missed six out of 18 HGT events that were identified during detailed, manual phylogenetic analyses of protein families from G. sulphuraria. The genome-wide bioinformatics screen thus had a non-negligible number of false negatives. This is even more obvious when total numbers of HGT candidates are compared: out of a total of 337, only 160 HGT candidates were identified by the genome-wide bioinformatics screen. The low sensitivity was accepted in favor of high stringency. Some proteins did not give a sufficient number (>10) of significant BLAST hits (score >50) in our 208-species dataset, as for example the Archaeal ATPases of protein family #1, where just one family member out of 133 was detected. In other cases, best BLAST scores in eukaryotic sequences did not correctly predict the position of a G. sulphuraria protein sequence in a detailed, manually constructed phylogenetic tree, also resulting in false negatives in the systematic bioinformatics screen. The rather stringent criteria for the manual inspection of each HGT candidate that aimed at preventing false positives are likely to have increased the number of false negatives as well. Due to the high stringency, the number of 75 HGT events giving rise to 337 genes should be considered a lower bound for HGT in G. sulphuraria.
As a control for contaminations with bacterial sequences during sample preparation and sequencing, all HGT candidate genes were compared by BLASTN against the NCBI nt database. The only significant hit was with a 'Cyanidium caldarium' (a historic name for G. sulphuraria and related Cyanidiophyceae) gene. All HGT candidates were mapped onto the 433 scaffolds, and the percentage of all genes on each scaffold was compared with the percentage of HGT candidates on each scaffold. This comparison did not show any indication for an enrichment of HGT candidates on specific scaffolds, excluding the possibility that HGT candidates were largely located on small scaffolds that cannot be assembled into the overall genome. A comparison of protein sequence lengths showed little difference between HGT candidates and all other proteins.
While bacterial (or archaeal) genes usually do not have introns, 72.4% of G. sulphuraria genes contain introns (table S1). For genes transferred recently from bacteria or archaea into the G. sulphuraria genome one therefore would expect fewer or no introns. Genes inferred to originate from horizontal gene transfer (table S4) on average have 0.80 introns per gene (55% have no intron), less than half of the 2.06 introns per gene for the entire genome (median 1 intron per gene; fig. S6). This significantly smaller number of introns per gene (P = 0.0012, Mann-Whitney test) is expected for genes that are of bacterial or archaeal origin and thus initially did not contain any introns. Similarly, the GC content of genomes from different organisms can differ significantly, and as a result, genes acquired by HGT can deviate in their GC content from the GC content of the entire genome (64). For G. sulphuraria the average GC content of all 6623 annotated genes is 39.9±0.034% compared to 38.5±0.16% for all 337 HGT candidate genes, and 40.6±0.2% for the 120 HGT candidates without Archaeal ATPases (fig. S7). All 217 Archaeal ATPase genes go back to a single HGT event (fig. S9; possibly by a gene with a rather low GC content; fig. S7), but make up almost two thirds of all genes likely to result from HGT. Their inclusion would thus strongly bias comparisons of HGT vs. non-HGT genes. Comparing the GC content of 120 HGT candidates (without Archaeal ATPase genes) to the entire genome indicates a significantly higher GC content for HGT candidates (P = 0.0030, t-test).
Finally, oligonucleotide usage of genomes from different organisms can differ, and as a result, genes acquired by HGT can deviate in their oligonucleotide usage from the rest of the
8
genome. To test this, we calculated ‘genomic signatures’ (zero order Markov distributions) using di-, tri-, tetra-, and hexanucleotides (65). The signature for each gene is defined by a vector containing the relative frequencies of all words of length n (n = 2, 3, 4, or 6), divided by the relative frequencies of the individual nucleotides (65). We first calculated the signatures of the coding sequences of all non-HGT genes, and averaged them to define a genomic reference signature. We then scored the signature of each individual gene by calculating the Pearson correlation coefficient (R) between the signature of the gene and the genomic reference signature. The dinucleotide score distribution of the 120 HGT candidates (without the Archaeal ATPases) showed a significantly stronger deviation from the genomic reference signature compared to non-HGT genes (P = 0.00034) (fig. S7). We obtained similar results using signatures of tri-, tetra-, and hexanucleotides (P = 0.00011, 0.0014, and 0.011, respectively). The lower average intron number of HGT candidates (fig. S6), the significantly different GC content (fig. S7), and the deviating oligonucleotide usage (fig. S7) can be taken as additional support that these genes in G. sulphuraria did indeed originate from bacterial or archaeal genes that had no introns, on average slightly higher GC contents, and different oligonucleotide usages compared to the core genome of G. sulphuraria.
To gain insight into the possible ‘donor’ organisms from which G. sulphuraria might have acquired genes by HGT, we determined for each HGT candidate the organism containing the protein that gave the best BLAST hit, or that was closest in a phylogenetic tree (table S4). Since most HGT events in G. sulphuraria likely occurred a long time ago, the extant strains identified in this way did probably not transfer genetic material into G. sulphuraria, but are likely to be descendants (or close relatives of descendants) of the original ‘donor’ organisms. This analysis indicates that there is no special bacterial or archaeal clade that mainly ‘donated’ genes to G. sulphuraria. Instead, G. sulphuraria acquired genes from a wide variety of different clades of Archaea and Bacteria (fig. S8).
It is becoming increasingly clear that ecological similarity shapes horizontal gene transfer (66), and therefore one would expect that a large fraction of potential gene ‘donors’ lived in the same extreme environment as G. sulphuraria does (and most likely did at the time of the HGTs). Indeed, when looking at the habitat of the bacterial organisms identified as descendants of possible gene donors, almost one third (22 out of 67) are thermophilic or thermoacidophilic, compared to less than 10% of all sequenced bacterial genomes (67 out of 927 as of February 2010). Thus, there is a significant enrichment in thermophilic and thermoacidophilic bacteria among the potential ‘donors’ for HGT into G. sulphuraria (P = 7.8×10-9, Fisher's exact test; Odds Ratio = 6.28 = (22/45)/(67/860)). In addition, we calculated the number of protein families that have orthologs in extremophilic bacteria or archaea. This resulted in 66 out of 76 proteins families containing HGT candidates, compared to 1648 out of 5117 non-HGT protein families (Odds Ratio = 13.89 = (66/10)/(1648/3469)). Thus, there is a significant enrichment of protein families with orthologs in extremophiles among HGT candidates (P = 1.52×10-22, Fisher's exact test).
Most HGTs from Bacteria or Archaea into G. sulphuraria seem to be ancient according to our phylogenetic analyses. The resulting protein sequences from G. sulphuraria in most cases branch off relative ‘deep’ within phylogenetic trees, forming rather long branches, and often subsequent gene duplications into several paralogs are observed. In a few instances, HGT even happened before the G. sulphuraria and C. merolae lineages split (see, e.g., fig. S13). This long evolutionary history within the G. sulphuraria genome probably explains why 45% (152 out of
9
337) of HGT candidates have acquired introns, even though they are of bacterial or archaeal origin.
Excreted proteins
Using a newly designed rRNA-specific oligonucleotide probe (Cya1208, 5’-AGCCCAGGACATCAAAGG-3’), we detected metabolically active Galdieria spp. cells in a subsurface mine, the Richmond Mine in northern California, by FISH (fluorescence in-situ hybridization) (http://dx.doi.org/10.5061/dryad.84r5q). G. sulphuraria cells growing in environments such as the Richmond Mine are dependent on the uptake of organic nutrients from the environment to drive heterotrophic growth, as the absence of light does not permit photosynthesis. It is reasonable to hypothesize that G. sulphuraria is ‘grazing’ on sulfur-oxidizing microbial communities to satisfy its energy needs since these bacteria are the sole net producers in this environment. Likely, G. sulphuraria is able to degrade the microbial biofilm by secretion of hydrolytic enzymes that release monosaccharides and other small metabolites from the organic polymeric film matrix. Supporting this hypothesis is the fact that G. sulphuraria possess a highly cross-linked, protein-rich cell wall (67), frequently incrusted with silica (68), that protects the alga from its own extracellular enzymes. Proteomic analyses have indicated that G. sulphuraria excretes several proteins, which may be related to its heterotrophic life style or may be involved in cell wall metabolism (18, 69). Genome-wide analyses of proteins containing a predicted secretory signal peptide (39, 70) showed more than half (138 from 266, or 52%) annotated as ‘hypothetical protein’, significantly more than in the genome as a whole (36%). Similarly, out of 17 proteins that were identified in a proteomic analysis (18) as being excreted (i.e., at least five-times more spectral counts in the medium compared to the soluble fraction), eight were annotated as ‘hypothetical proteins’ due to a lack of similarity to any protein with known function.
10
Fig. S1.
Molecular divergence between G. sulphuraria and C. merolae. As a measure of molecular divergence, percent amino acid identity of orthologous gene pairs was compared for different species pairs. Sets of orthologous gene pairs were identified by reciprocal best BLAST hits with BLAST scores > 50. Normalized (to 100%) cumulative frequencies are plotted against % amino acid identity for six different species pairs. The two red algal genomes display a slightly lower amino acid identity (median 44.9%) compared to Homo sapiens / Drosophila melanogaster (45.4%). The vertebrate – insect divergence probably occurred about 910 million years ago (55), indicating a similar age for the G. sulphuraria – C. merolae split.
0
10
20
30
40
50
60
70
80
90
100
20 30 40 50 60 70 80
Cu
mu
lati
ve F
req
uen
cy (%
)
% Amino Acid Identity
G. sulphuraria / T. brucei
G. sulphuraria / H. sapiens
G. sulphuraria / O. sativa
G. sulphuraria / C. merolae
H. sapiens / D. melanogaster
H. sapiens / S. purpuratus
11
Fig. S2
Colinear regions between G. sulphuraria and C. merolae genomes. A colinearity plot of all 20 C. merolae chromosomes (Y axis) against 433 G. sulphuraria scaffolds (X axis) was generated with ColinearScan (http://colinear.cbi.pku.edu.cn/#overview) using a minimum BLAST score of 100. Blue dots indicate orthologous genes identified using InParanoid (41), which are linked by red lines if they are in synteny blocks. Genes in one block are not necessarily next neighbors, but may be separated by other genes, to allow for gene loss, gene creation, and/or minor chromosomal re-arrangements.
0 2 4 6 8 10 15 20 25 30 35 40 45 60 100 22050 432
234
5
6
7
8
1
9
10
11
12
13
14
15
16
17
18
19
20C
ya
nid
iosc
hy
zon
me
rola
e c
hro
mo
som
es
Galdieria sulphuraria scaffolds
2 M
b
2 Mb
12
Fig. S3.
Distribution of distances between coding regions in different unicellular algae, displayed as normalized (to 100%) cumulative frequencies against length (log-scale). Median values in bp are (from smallest to largest): Ostreococcus tauri, 228 (71); G. sulphuraria, 230; Ostreococcus lucimarinus, 284 (72); Thalassiosira pseudonana, 634.5 (73); Phaeodactylum tricornutum, 687 (10); C. merolae, 1404.5 (7); Chlamydomonas reinhardtii, 1534 (42); and Aureococcus anophagefferens, 1563 (74). Color coding: Chlorophyta, Rhodophyta, Stramenopiles).
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000 10000 100000
Cu
mu
lati
ve F
req
uen
cy (
%)
Distance between coding regions (bp)
G. sulphurariaC. merolaeP. tricornutumT. pseudonanaA. anophagefferensC. reinhardtiiO. lucimarinusO. tauri
13
Fig. S4.
Histogram (log-scale) for the number of introns per gene. The distribution of genes with a certain number of introns is described by an exponential decay function (light blue line),
expy A K x with start amplitude A = 2860±59 and rate constant K = 0.47±0.01.
14
Fig. S5.
Intron and exon size distributions of the G. sulphuraria genome. Size distributions of 13630 introns (top) and 20050 exons (bottom) are displayed as frequency against logarithmic length (log10 bp). (Top) The distribution of logarithmic intron lengths is described by a simple Gaussian distribution with Ampl = 2796±34, Mean = 1.752±0.0005 (56.5 bp), and SD = 0.0363±0.0005. (Bottom) The distribution of logarithmic exon lengths is more complex and was described by the sum of two Gaussian distributions
2 2
1 21 2
1 2
0.5 0.5x Mean x Mean
y Ampl exp Ampl expSD SD
with Ampl1 = 926±13, Mean1 = 2.20±0.01 (158 bp), SD1 = 0.299±0.008, Ampl2 = 422±15, Mean2 = 2.95±0.02 (891 bp), and SD2 = 0.27±0.01. Bin width is 0.02 for the upper and 0.5 for the lower graph; note different X-axis scaling in top and bottom graph.
15
Fig. S6.
Genes originating from horizontal gene transfer have fewer introns. The histograms show the percentage of genes with a given number of introns (compare fig. S4). A comparison of all 6623 annotated genes (blue) with the 337 genes likely to originate from horizontal gene transfer (orange; HGT) shows a significantly lower number of introns in genes originating from HGT (average of 0.8 introns per gene compared to 2.06 for all genes; P = 0.0012, Mann-Whitney test).
0%
10%
20%
30%
40%
50%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Fre
qu
en
cy
# of Introns
All GenesHGT Candidates
16
Fig. S7.
Genes originating from horizontal gene transfer differ in GC content (top) and dinucleotide usage (bottom) from the rest of the genome. Top: Histogram of GC content in 1% intervals. A comparison of all 6623 annotated genes (blue) with the 120 HGT candidates (*excluding the 217 genes encoding Archaeal ATPases; orange) shows a significantly higher GC content of HGT candidates (P = 0.0030, t-test). The 217 genes encoding Archaeal ATPases (grey) show a significantly (P = 0.0035) lower GC content compared to all genes (blue). Bottom: Histogram of dinucleotide usage scores in 0.05 intervals. A comparison of all non-HGT genes (green) with the 120 HGT candidates (*excluding the 217 genes encoding Archaeal ATPases; orange) shows different dinucleotide usage scores (‘genomic signatures’). Dinucleotide usage score was defined as Pearson’s correlation coefficient R between normalized dinucleotide frequencies in each gene and the corresponding mean normalized frequencies across all G. sulphuraria non-HGT genes (for details see text). As all 217 genes encoding Archaeal ATPases result from the transfer of a single gene (fig. S9), their inclusion would strongly bias both analyses (top and bottom), and consequently these genes were excluded from the statistical comparisons.
0
5
10
15
20
25
30
22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54
Fre
qu
ency
(%
)
% GC
All Genes
HGT Candidates*
Archaeal ATPases
0
10
20
30
40
50
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Fre
qu
ency
(%
)
R (for Dinucleotide Usage)
Non-HGT Genes
HGT Candidates*
Archaeal ATPases
17
Fig. S8.
Number of horizontal gene transfers from different phyla into the genome of G. sulphuraria. For each horizontal gene transfer the phylum of the ‘donor’ organism from which the gene might have originated was determined by phylogenetic analysis or best BLAST hits (see table S4 for details).
18
Fig. S9.
Phylogenetic tree of the MNS clade of the STAND class of P-loop ATPases. The unrooted Bayesian (52) tree calculated with a CpREV+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) bootstrap support values (using LG+I+G+F) below the branches. Thick branches indicate 1.0 posterior probability. Constraining all eukaryotic sequences into one monophyletic branch outside the Archaeal ATPases resulted in a tree with pRELL = 0.0. For clarity some sub-branches with sequences from the same clade have been collapsed (elongated triangles) with the height of the triangles reflecting the number of taxa included (3 to 22). The tree was constructed from 136 sequences: 41 from G. sulphuraria, all 56 seed sequences of the ‘Arch_ATPase’ family (PF01637) (76), plus sequences mentioned in Leipe et al. (20) plus BLAST (35) hits of these three groups for sequences outside the Archaeal ATPases. Sequences with >90 % identity were removed, and thus only one sequence from protein family #19 is included. Square brackets indicate proteins of monophyletic origin. Six families as established by Leipe et al. (20) are indicated; MJ-type, SSO-type, and PH-type families of the Archaeal ATPases, plus BL0662 family, SpsJ family, and Npun2340/2341 family outside the Archaeal ATPases. Color coding: Archaea, Cyanobacteria, other Bacteria, Rhodophyta, Amoebozoa. Bar represents 0.2 changes per site.
Dictyostelium discoideum DDBG0277693 Dictyostelium discoideum DDBG0269752
Npun2340/2341 Cyanobacteria( )
Lyngbya majuscula LYNGBM3L_74060 Trichodesmium erythraeum Tery1785
Cyanobacteria Pseudomonas syringae PSYAR_12904
Synechocystis sp. slr1243 Proteobacteria
SpsJ
( )BL0662 Actinobacteria Bacteroides thetaiotaomicron BT0660
Parabacteroides sp. HMPREF0619_03070 Thermococcus barophilus TERMP_02055 Pyrococcus furiosus PF0007
Thermococcus gammatolerans TGAM_0978 Methanosarcina acetivorans MA1854
. GI:1914801Thermococcus sp Methanocaldococcus jannaschii MJECL04
MTH196Methanothermobacter thermautotrophicus Pyrococcus horikoshii PH0977
PAB1582Pyrococcus abyssi PH0436Pyrococcus horikoshii
PAB1598Pyrococcus abyssi PAB1371Pyrococcus abyssi
Fusobacterium nucleatum FN0123 PH0846Pyrococcus horikoshii
PAB2383Pyrococcus abyssi
PH
0846
Arc
h_1
SSO1545Sulfolobus solfataricus CTN 1567Thermotoga naphthophila TM1011Thermotoga maritima
Pyrococcus SS
O1
54
5
Protein family #2
Protein family #19 Protein family #1b
Protein family #1a
Ga
ldie
ria
sulp
hu
rari
a
Methanocaldococcus jannaschii PH0541Pyrococcus horikoshii
PAB0800Pyrococcus abyssi MJ0
074
Euryarchaeota
PAE2855Pyrobaculum aerophilum
Pcal1542Pyrobaculum calidifontis Tpen1344Thermofilum pendens
Hbut0744Hyperthermus butylicus Pcal1338Pyrobaculum calidifontis
Pcal0469Pyrobaculum calidifontis Pcal0902Pyrobaculum calidifontis
Caldivirga maquilingensis Tpen1381Thermofilum pendens
Cmaq1422Caldivirga maquilingensisPyrobaculum
Cre
nar
chae
ota
Arc
ha
eal A
TP
ases
0.9870
100
100
100
100
100
0.94
0.81
0.690.8261
0.93
98
0.82
0.61
0.9986
98
0.94
0.935583
8382
0.530.70 72
0.8
0
0.97
820.77
96
0.90100
100
100100
100
100
100
10095
0.5
1
72
0.96
0.9089
98
0.88
0.5
7
0.5
9
0.97
870.580.56
0.2
19
Fig. S10.
Manual alignment of Archaeal ATPase domains of proteins from different families of the MNS clade of the STAND class of P-loop ATPases (20). Top to bottom: ten seqs. from the MNS clade not belonging to the Archaeal ATPases (six of the Npun2340/2341 family and four of the SpsJ family), 15 seqs. from the Archaeal ATPases (three each of the five families; fig. S9), and ten seqs. from G. sulphuraria (four each of protein family #1 and #2, two of family #19). An asterisk indicates a column of strict conservation. A poorly conserved region between strand 2 and 3 indicated by -X- was omitted, as was the region between strand 3 and 4 separating the two blocks of alignment shown. Secondary structure elements indicated below the alignment. Amino acids are color coded according to their biochemical properties: hydrophobic (A, F, I, L, M, V in yellow), neutral (N, Q, S, T, W in green), negative (D, E in red), positive (K, R in blue), cysteine (C in olive), glycine (G in fuchsia), histidine (H in teal), tyrosine (Y in lime), proline (P in blue).
20
Gasu_29710 Gasu_61330
Gasu_07060 Gasu_63600
Gasu_07070 Gasu_61130 Gasu_15680 Gasu_63480 Gasu_63570 Gasu_64870
Gasu_65290 Gasu_26630 Gasu_65440 Gasu_64910
Gasu_64580 Gasu_51710
Gasu_56180 Gasu_63950
Gasu_24660 Gasu_63640
Gasu_36120 Gasu_65980
Gasu_42890 Gasu_64690
Gasu_34410 Gasu_62190 Gasu_31660 Gasu_19180 Gasu_32560 Gasu_63670
Gasu_38370 Gasu_65590 Gasu_00960
Gasu_39160 Gasu_61070
Gasu_51120 Gasu_62770
Family #1a
Gasu_30810 Gasu_61090
Gasu_36780 Gasu_61670
Gasu_25720 Gasu_41410
Gasu_19170 Gasu_37570
Gasu_63280 Gasu_65430
Gasu_27640 Gasu_56170
Gasu_64600 Gasu_29700
Gasu_15690 Gasu_63200 Gasu_31650 Gasu_03880 Gasu_51100
Gasu_07080 Gasu_51130
Gasu_63590 Gasu_65050
Family #1b
Gasu_63350 Gasu_33390
Gasu_59780 Gasu_31670
Gasu_61280 Gasu_34390
Gasu_25640 Gasu_54000 Gasu_49850
Gasu_61270 Gasu_24690
Gasu_62800 Gasu_64370
Gasu_42170 Gasu_61430 Gasu_42880
Gasu_36830 Gasu_25710
Gasu_61310 Gasu_38330 Gasu_61650 Gasu_36130
Gasu_64950 Gasu_61230
Gasu_21290 Gasu_51110
Gasu_34430 Gasu_00970
Gasu_61390 Gasu_41370
Gasu_33830 Gasu_24670
Gasu_64220 Gasu_03890 Gasu_46060
Family #2
Gasu_39090 Gasu_48380
Gasu_37580 Gasu_25730 Gasu_63190 Gasu_29690
Family #19
0.5
8
0.9
0
92
97
100
0.96
0.93
0.750.75
0.75
0.75
0.75
0.750.75
0.74
0.71
0.97
99
100
0.800.820.91
0.54
0.90
0.69
0.540.51
0.50
0.2
Fig. S11.
Phylogenetic tree of G. sulphuraria protein families #1, #2, and #19. The unrooted Bayesian (52) tree calculated with a JTT+G model of protein evolution shows posterior probabilities at the branches. PhyML (75) percent bootstrap support values (using JTT+G+F) below the branches, for clarity, are only given for major branches. Thick branches indicate ≥0.99 posterior probability. The tree was constructed from all 101 complete sequences (including those >90% identical) of protein families #1, #2, and #19. Bar represents 0.2 changes per site.
21
Fig. S12.
Number of Archaeal ATPase genes per 1000 coding genes in genomes of thermophilic and hyperthermophilic organisms as function of optimum growth temperature. Genomes from 57 thermophilic and 29 hyperthermophilic archaea and bacteria were downloaded from NCBI via the “microbial genomes properties” portal (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi ). Optimum growth temperatures were obtained from NCBI. If temperatures were given as a range (58-60°C for example), the highest number was chosen (60°C in this case). In total, we obtained valid optimum growth temperatures for 38 thermophilic (including 8 archaea, in red, and 30 bacteria, in cyan) and 26 hyperthermophilic (all archaea) organisms. The HMM model of Archaeal ATPases (accession ID: PF01637.10) was downloaded from the Pfam database (76). HMMER version 3 beta (http://hmmer.janelia.org/ ) was used to score each protein sequence in each thermophilic and hyperthermophilic species against this HMM model. A protein was considered an Archaeal ATPase if the e-value of the HMM search was less than 10-5. A trend line (solid) plus correlation coefficient are given for both data sets combined. A trend line for data points from archaea only gave a similar result (dashed line; R = 0.56).
0
5
10
15
20 40 60 80 100 120
AT
Pas
es /
1000
gen
es
Optimum Growth T / °C
R = 0.72
22
Fig. S13.
Phylogenetic tree of the monovalent cation/proton antiporter-1 (CPA1, TC 2.A.36) family, indicating a high phylogenetic diversity in G. sulphuraria. The unrooted Bayesian (52) tree calculated with a Blosum62+G model of protein evolution shows posterior probabilities above branches and the PhyML (75) percent bootstrap support (using WAG+I+G+F) below the branches. Thick branches indicate ≥0.99 posterior probability. For clarity some sub-branches have been collapsed (elongated triangles), with the height of the triangles reflecting the number of taxa included (3 to 12). Curly brackets indicate different cellular locations according to TCDB annotations (38, 77). The dashed line separates the upper branch, mainly containing bacteria, from the lower two branches, mainly containing eukaryotes. Color coding: Bacteria, Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Alveolata, Excavata, Amoebozoa, Fungi, Animals & Choanoflagellates. Bar represents 0.5 changes per site.
Gasu_24490Galdieria sulphuraria
Gasu_18700Galdieria sulphuraria
CMR382CCyanidioschyzon merolae
CHU 1783Cytophaga hutchinsonii
Noc 0159Nitrosococcus oceani
Bacteria
APJL 1022Actinobacillus pleuropneumoniae
PM0365Pasteurella multocida
GSPATP00033903001Paramecium tetraurelia
Land Plants {Endosomal}
Green Algae
Green Algae
Fungi {Vacuolar}
Bacteria
Eukaryotes
Land Plants {Vacuolar}
CHLREDRAFT 1749Chlamydomonas reinhardtii
Chlre3 206027Chlamydomonas reinhardtii
86.m00157Tetrahymena ttetraurelia
. OSTLU 1871Osterococcus ucimarinus
Ostta 3 1816Ostreococcus tauri
Stramenopiles
152.m00122Entamoeba histolytica
156.m00102Entamoeba histolytica
Naegr1 46137Naegleria gruberi
130175Phytophthora sojae
Phyra1_1_72664Phytophthora ramorum
CMN286CCyanidioschyzon merolae
Gasu_50550Galdieria sulphuraria
DDB0231789Dictyostelium discoideum
Monbr1 32693Monosiga brevicollis
Animals {Golgi / Intracellular}
LmjF23.0830Leishmania major
Gasu_00640Galdieria sulphuraria
Gasu_55230Galdieria sulphuraria
CMS152CCyanidioschyzon merolae
CMS154CCyanidioschyzon merolae
Stramenopiles
Alveolata
Monbr1 29857Monosiga brevicollis
Monbr1 38746Monosiga brevicollis
Alveolata
Auran1 22521Aureococcus anophagefferens
Ctha 1817Chloroherpeton thalassium
GI 1485395Strongylocentrotus purpuratus
ENSP00000356687Homo sapiens Green Algae
Land Plants {SOS1}
0.9
8
0.56
100
100
100 100
100
100
100
100
100
100
100
100
100
100
100
100
100
77
61
96
79
89
9271
99
100
100
82
80
65
100
77
9053
0.67
0.8
1
0.93
0.87
0.8
6
0.9
40.69
0.98
0.8
4
0.940.52
0.93
0.74
0.94
0.87
0.5
Intr
ace
llu
lar
Pla
sma
Mem
bra
ne
23
Fig. S14.
Phylogenetic tree of the S-adenosylmethionine-dependent methyltransferase (SAM) superfamily, indicating horizontal gene transfer of a sarcosine dimethylglycine methyltransferase gene from halophilic cyanobacteria to G. sulphuraria. The unrooted Bayesian (52) tree calculated with a WAG+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using WAG+I+G) below the branches. Thick branches indicate ≥0.99 posterior probability. Square brackets group proteins according to their origin from the two domains of life (Bacteria, Eukaryota). Curly brackets indicate different subfamilies of the SAM superfamily according to NCBI CDD (27, 78). Color coding: Cyanobacteria, other Bacteria, Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Alveolata, Animals. Bar represents 0.2 changes per site.
RS9917 03558Synechococcus sp.
WH7805 07371Synechococcus sp.
P9303_17021Prochlorococcus marinus
MIT9313 PMT0551Prochlorococcus marinus
Gasu_07580Galdieria sulphuraria
Gasu_07590Galdieria sulphuraria
Gasu_06500Galdieria sulphuraria
Tgr7 1180Thioalkalivibrio sp.
Ctha 2187Chloroherpeton thalassium
NB-231 10508Nitrococcus mobilis
GI 9392587Halorhodospira halochloris
Hhal 1677Halorhodospira halophila
kuste4122Candidatus Kuenenia stuttgartiensis
GI 28316392Aphanothece halophytica
L8106 04936Lyngbya sp.
M23134 00297Microscilla marina
SvirDRAFT 04Saccharomonospora viridis
SACE 3901Saccharopolyspora erythraea
GI 9392589Actinopolyspora halophila
HochDRAFT 10430Haliangium ochraceum
FP2506 12834Fulvimarina pelagi
Bac
teri
aB
act
eri
a
Phatr2_12280Phaeodactylum tricornutum
Phyra1_1_76539Phytophthora ramorum
137871Phytophthora sojae
SB234M12.18Sorghum bicolor
GI:162461873Zea mays
GI:116785181Picea sitchensis
MAL13P1.214Plasmodium falciparum
CBP02171Caenorhabditis briggsae
ENSXETP00000044432Xenopus tropicalis
Phyra1_1_71173Phytophthora ramorum
109246Phytophthora sojae
Eu
kary
ota
Thaps3_269095Thalassiosira pseudonana
Auran1_59958Aureococcus anophagefferens
MICPUCDRAFT49588Micromonas pusilla
OSTLU16375Ostreococcus lucimarinus
Ostta4_34203 0800Ostreococcus tauri
0.60
100
100
100
98
80
87
56
7899
9198
820.58
0.92
0.6
8
0.83
98
98
0.9255
0.91
0.54
99
94
7051
100
0.83
0.56
0.92
0.88
0.2
Sa
rco
sin
e-D
imet
hyl
gly
cin
e M
eth
ylt
ran
sfe
rase
Ph
osp
ho
eth
ano
lam
ine
N-M
eth
yltr
an
sfe
ras
e
Cyclopropane FattyAcid Synthase
TocopherolO-Methyltransferase
TocopherolO-Methyltransferase
24
Fig. S15. 1H-NMR spectra showing choline-O-sulfate, proline betaine, and glycine betaine accumulations in G. sulphuraria. The spectra shown were obtained from crude extracts of heterotrophic cells grown under control conditions (black line) or treated with 500 mM NaCl for 84 h (red line). Extraction and measurement were performed as described earlier (79). Abbreviations used: GB, glycine betaine; PB, proline betaine; COS, choline-O-sulfate; t-but, ter-butanol. Values in ppm assigned to NMR signals are chemical shifts of either the 9 protons of the -N+(CH3)3 group of glycine betaine or choline-O-sulfate, the 3 protons of each of the methyl group of the -N+(CH3)2 or those of the methylene group of those substances when detectable. Ter-butanol was selected as an internal standard for quantification of betaines. The 9 protons of its -C(CH3)3 group give a sole signal at 1.200 ppm. The signals at 3.181, 3.217, and 3.251 ppm were taken as specific target for quantification of choline-O-sulfate, glycine betaine and proline betaine, respectively. The spectra are presented at the same scale (standardized with ter-butanol).
25
Fig. S16.
Phylogenetic tree of the anion permease ArsB/NhaD superfamily, indicating horizontal gene transfer of an ArsB gene from bacteria to G. sulphuraria. The unrooted Bayesian (52) tree calculated with a CpREV+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+I+G) below the branches. Thick branches indicate 1.0 posterior probability. Square brackets group proteins according to their origin from Eukaryota, Archaea, or different phyla of Bacteria. Curly brackets indicate different subfamilies of the anion permease ArsB/NhaD superfamily (27). The label ‘Thermophilic/Acidophilic’ marks G. sulphuraria and four bacteria being thermophilic and/or acidophilic, living in similar habitats as G. sulphuraria does. Color coding of major phylogenetic groups: Bacteria, Archaea, Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Excavata, Fungi, Animals & Choanoflagellates. Bar represents 0.5 changes per site.
FBgn0051693Drosophila melanogaster FBgn0035332melanogasterDrosophila
FBgn0036329melanogasterDrosophila FBgn0041150 melanogasterDrosophila
ENSXETP00000009004Xenopus tropicalis ENSP00000346659Homo sapiens MGI 97454Mus musculus
29712Monosiga brevicollis Ostreococcus tauri 8380
44788Phaeodactylum tricornutum T 262258halassiosira pseudonana
182280Physcomitrella patens 190982Physcomitrella patens
139294Physcomitrella patens Os03g01700Oryza sativa
Os10g39980Oryza sativa Os02g57620Oryza sativa
Os03g05390Oryza sativa GSVIVP00037686001Vitis vinifera GSVIVP00000441001Vitis vinifera GSVIVP00032514001Vitis vinifera
69397Naegleria gruberi 93857 m00324Trypanosoma vaginalis
91127m00285Trypanosoma vaginalis 91660 m00056Trypanosoma vaginalis
174287Chlamydomonas reinhardtii AAW41753Cryptococcus neoformans
CHG09358Chaetomium globosum Afu2g15090Aspergillus fumigatus AN5302Aspergillus nidulans
Eu
kary
ota
TVN0285Thermoplasma volcanium GI 16082492Thermopasma acidophilum
050500004700Ferroplasma acidarmanus PTO0083Picrophilus torridus A
rch
ae
a
ArsBListeria welshimeri SSPP115Staphylococcus saprophyticus
ArsBBacillus halodurans ArsBEscherichia coli
PROPEN_04102Proteus penneri Psyr_1503Pseudomonas syringae
ArsB-1Pseudomonas putida ArsB (Deinococcus-Thermus)Deinococcus geothermalis
Gasu_56050Galdieria sulphuraria
Th
erm
op
hili
cA
cid
op
hili
c
Ars
B,
arsen
ical
pu
mp
mem
bra
ne
pro
tein
An
ion
per
mea
seA
rsB
/Nh
aD
Pe
rme
ase
P/
P p
rote
in
Sili
con
eff
lux t
ran
sp
ort
er(a
nio
n p
erm
ease
Yb
iR)
Gasu_31570Galdieria sulphuraria ArsB (Nitrospirae)Leptospirillum ferriphilum
sp. ArsB1Thiomonas
ArsBPsychrobacter arcticus
Lferr_2479Acidithiobacillus ferrooxidans ArsB2Roseomonas cervicalis ArsBMethylobacterium extorquens Mrad2831_1380Methylobacterium radiotolerans
Pro
teo
bac
teri
a
Proteobacteria
Firmicutes
0.51
0.6
9
0.59
61
60
58
0.560.92
68
0.9
0
71
98
92
0.84
84
9396
89
78100
100100
100
100
100
100
100100
100
100
100
100
100
100
0.9
8
80
0.65
0.9
7
96
0.8880
69
0.97
58
0.98
0.9
3
0.56
92
63100
0.5
26
Fig. S17.
Phylogenetic tree of mercuric reductase. The unrooted Bayesian (52) tree calculated with a WAG+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using WAG+G+F) below the branches. Thick branches indicate 1.0 posterior probability. For bacterial sequences the class for each species is given in abbreviated form for Gammaproteobacteria and Betaproteobacteria. Color coding of major phylogenetic groups: Bacteria, Rhodophyta. Bar represents 0.1 changes per site.
merA, Methylococcus capsulatus Gammaprot’
merA, Pseudomonas aeruginosa Gammaprot’
mer, Stenotrophomonas maltophilia Gammaprot’
Daci0483, Delftia acidovorans Betaprot’
merA, Pseudomonas fluorescens Gammaprot’
merA2, Pseudomonas aeruginosa Gammaprot’
Nhal1676, Nitrosococcus halophilus Gammaprot’
sp. MerA, Thiobacillus Betaprot’
merA, Acidithiobacillus ferrooxidans Gammaprot’
merA, Salmonella enterica Gammaprot’
sp. B1M04319, Burkholderia Betaprot’
merA, Enterobacter cloacae Gammaprot’
Neut0030, Nitrosomonas eutropha Betaprot’
sp. MerA, Bacillus Bacilli
merA, Proteus mirabilis Gammaprot’
Galf1900, Gallionella capsiferriformans Betaprot’
Tint1608, Thiomonas intermedia Betaprot’
CMJ014CCyanidioschyzon merolae
Tbd1341, Thiobacillus denitrificans Betaprot’
Gasu_60470Galdieria sulphuraria
sp. BW13 merA, Pseudomonas Gammaprot’
Sfri3488, Shewanella frigidimarina Gammaprot’
sp. Sputw3181_3208, Shewanella Gammaprot’
Sputcn32_0168, Shewanella putrefaciens Gammaprot’
sp. Shewana3_4311, Shewanella Gammaprot’
merA, Escherichia coli Gammaprot’
100
100
98
83
90
99
0.86/550.67
0.95/80
0.87
0.98100
0.64
0.9851
59
65
99
95
94
92
8968
0.90
100
0.1
27
Fig. S18.
Percent of proteome annotated as membrane transporter for 49 different eukaryotes. The red line indicates the 5% level. Total numbers of transporter proteins were taken from TransportDB (36). The TransAAP tool from TransportDB annotated 368 membrane transport proteins for the G. sulphuraria genome. After manual curration this number was corrected to 247 membrane transport proteins; for consistency, only those were considered here.
0 1 2 3 4 5 6 7
Physcomitrella patensArabidopsis thaliana
Oryza sativaOstreococcus sp.
Ostreococcus tauriMicromonas pusilla CCMP1545
Micromonas pusilla RCC299Chlamydomonas reinhardtii
Cyanidioschyzon merolaeGaldieria sulphuraria
Phytophthora infestansPhytophthora ramorum
Phytophthora sojaeEctocarpus siliculosus
Aureococcus anophagefferensPhaeodactylum tricornutumThalassiosira pseudonana
Babesia bovisBabesia equi
Cryptosporidium parvumPlasmodium falciparum 3D7
Plasmodium vivaxTheileria parva
Toxoplasma gondii B7Paramecium tetraurelia
Tetrahymena thermophilaTrichomonas vaginalis
Leishmania majorTrypanosoma brucei
Trypanosoma cruziDictyostelium discoideum
Entamoeba disparEntamoeba histolyticaEntamoeba invadensAspergillus fumigatus
Aspergillus nidulansAspergillus oryzaeNeurospora crassa
Saccharomyces cerevisiaeSchizosaccharomyces pombe
Cryptococcus neoformansEncephalitozoon cuniculi
Brugia malayiCaenorhabditis elegans
Aedes aegyptiAnopheles gambiae
Drosophila melanogasterMus musculusHomo sapiens
% of proteins
28
Animals
Monosiga brevicollis Monbr1_10586
Land Plants
Trichomonas vaginalis
DDB0238519Dictyostelium discoideum
Chlre3_196325Chlamydomonas reinhardtii
Ostta4_29494Ostreococcus tauri
SS1G_01070Sclerotinia sclerotiorum
Animals
Animals
. 136367Physcomitrella p
. 113769Physcomitrella p
Naegr1_65611Naegleria gruberi
Alveolata
Alveolata
Saccharomyces cerevisiae
Ustilago maydis UM05023
Phytophthora
Cyanidioschyzon merolae CMK066
Oryza sativa Os07g0559700
Bacteria
Bacteria
Bacteria
Haloarcula marismortui GI:55376617
sp. RHA1_ro01361Rhodococcus
TcaBSaccharopolyspora erythraea
GbCGDNIH1_1264Granulibacter bethesdensis
sp. GI:169189897Paenibacillus
Animals
Galdieria sulphuraria Gasu_04230
Ostreococcus lucimarinus OSTLU34203
Aureococcus anophagefferens Auran1_1914
Phatr2_11826Phaeodactylum tricornutum
Land Plants
Land Plants
Monbr1_8633Monosiga brevicollis
MONBRDRAFT_16319Monosiga brevicollis
Stramenopiles
Fungi
Land Plants
Excavata
Phytophthora
Aureococcus anophagefferens Auran1_71194
Phaeodactylum tricornutum Phatr2_27361
Animals
Land Plants
Fungi
Galdieria sulphuraria
Gasu_53180Galdieria sulphuraria
SPT1
VGT1
SPT2
Frt1
Galdieria sulphuraria
Gasu_59430Galdieria sulphuraria
Galdieria sulphuraria
Gasu_11560Galdieria sulphuraria
Gasu_37640Galdieria sulphuraria
Galdieria sulphuraria
Gasu_49750Galdieria sulphuraria
Gasu_57830.Galdieria sulphuraria
Ascomycota (Fungi)
100
100
100
81
100
52
0.98
0.68
0.83
0.54
0.5
8
0.84
0.5
40.
810
.95
0.62
0.94
0.9
10.94
0.8
7
0.76
0.57
0.94
0.83
57
73
0.99
100
100
100
100
9999
99
70
98
99
58
100
0.95
0.90
0.62
100
10068
86
100100
100
78
99
100
0.97
100
100
100
0.57
540.74
0.89
0.6
30.81
100
0.91
0.66
100
75
100
97
5480
100
0.75
100
82
0.70
0.72100
1000.62
99
100
0.2
TC 2.A.1.1.69
TC 2.A.1.1.45
TC 2.A.1.1.33
TC 2.A.1.1.70
29
Fig. S19.
Phylogenetic tree of the sugar porter (SP) family of the major facilitator superfamily (MFS). The unrooted Bayesian (52) tree calculated with a CpREV+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) bootstrap support values (using LG+I+G+F) below the branches. Thick branches indicate 1.0 posterior probability. For clarity most sub-branches have been collapsed (elongated triangles) with the height of the triangles reflecting the number of taxa included (3 to 29). Curly brackets indicate different subfamilies of the sugar porter family according to the Transporter Classification Database (38). The founding member of each subfamily is indicated in white on the collapsed branch where it is included. Color coding of major phylogenetic groups: Archaea, Bacteria, Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Alveolata, Excavata, Amoebozoa, Fungi, Choanoflagellates & Animals. Bar represents 0.2 changes per site.
30
Fig. S20.
Phylogenetic tree of the amino acid / auxin permease (AAAP) family. The unrooted Bayesian (52) tree calculated with a CpREV+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+G+F) below the branches. Thick branches indicate 1.0 posterior probability. For clarity most sub-branches have been collapsed (elongated triangles) with the height of the triangles reflecting the number of taxa included (3 to 46). Curly bracket indicates a subfamily of fungal amino acid permeases (TC 2.A.18.4), which includes all G. sulphuraria sequences, according to best BLAST hits in the Transporter Classification Database (38). Color coding of major phylogenetic groups: Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Excavata, Amoebozoa, Fungi, Choanoflagellates & Animals. Bar represents 0.2 changes per site.
Fungi (Ascomycota)
Gasu_01170Galdieria sulphuraria
Gasu_19710Galdieria sulphuraria
Fungi
Pyrenophora tritici-repentis GI:189189672
Fungi
Galdieria sulphuraria
Gasu_45890Galdieria sulphuraria
Gasu_62620Galdieria sulphuraria
Chlre3_206105Chlamydomonas reinhardtii
Monbr1_25158Monosiga brevicollis
MONBRDRAFT_38331Monosiga brevicollis
Auran1_62192Aureococcus anophagefferens
Ot03g02520Ostreococcus tauri
UM03049.1Ustilago maydis
Land Plants
Phytophthora
Animals
Land Plants
Phatr2_9355Phaeodactylum tricornutum
Thaps3_264537Thalassiosira pseudonana
88788.m00034Trichomonas vaginalis
93204.m00168Trichomonas vaginalis
Animals
Fungi
Entamoeba histolytica
Naegr1_71189Naegleria gruberi
Excavata
0.57500.92
100
100
100
100
100
100
100
100
100
100
100
100
98
52
73
95
94
56
58
68
99
98
90
94
63
0.53
0.95
0.53
0.99
0.93
0.51
0.94
0.53
0.51
0.66
0.59
0.57
0.83
0.5
TC
2.A
.18.
4 (
fun
gal
per
me
ases
)
31
Fig. S21.
Phylogenetic tree of acetate permeases of the YaaH family. The unrooted Bayesian (52) tree calculated with a Blosum62+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+G+F) below the branches. Thick branches indicate 1.0 posterior probability. Three subfamilies of acetate permeases according to the Transporter Classification Database (38) are indicated, with the founding member of each subfamily indicated in white on the collapsed branch where it is included. Color coding of major phylogenetic groups: Archaea, Bacteria, Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Alveolata, Excavata, Fungi). Bar represents 0.2 changes per site.
Leishmania major LmjF03.0400
LmjF03.0380Leishmania major
Ichthyophthirius multifiliis IMG5_002930
Calditerrivibrio nitroreducens Calni_1246
Dbac_3478Desulfomicrobium baculatum
MCON_2287Methanosaeta concilii
CfE428DRAFT_126Chthoniobacter flavus
Gmet_1126Geobacter metallireducens
Mhun_0634Methanospirillum hungatei
Mboo_0065Methanoregula boonei
sp. CAT7_04262Carnobacterium
Cphy_0364Clostridium perfringens
MCP_2738Methanocella paludicola
archaeon RC-I RCIX138
TC 2.A.96.1.1
sp. PCNPT3_10676Psychromonas
Desal_0840Desulfovibrio salexigens
sp. GM18DRAFT 1497Geobacter
Gammaproteobacteria
TPY_0697Sulfobacillus acidophilus
sp. BH160DRAFT_3638Burkholderia
BuboB_01010000444Burkholderia ubonensis
GI:218289874Alicyclobacillus acidocaldarius
Euryarchaeota
Acel_1067Acidothermus cellulolyticus
Namu_0029Nakamurella multipartita
Gobs_1305Geodermatophilus obscurus
Firmicutes
Gasu_14180Galdieria sulphuraria
Gasu_35520Galdieria sulphuraria
Gasu_18410Galdieria sulphuraria
Gasu_00440Galdieria sulphuraria
Gasu_47190Galdieria sulphuraria
Gasu_62370Galdieria sulphuraria
Gasu_57190Galdieria sulphuraria
Gasu_57950Galdieria sulphuraria
Gasu_07700Galdieria sulphuraria
Gasu_07690Galdieria sulphuraria
Eukaryota
Fungi
Bacterial origin Actinobacteria
Green Algae
Thaps3_20982Thalassiosira pseudonana
Cyanidioschyzon merolae CMM079C
Leishmania major LmjF03.0370
Physcomitrella
Aspergillus oryzae Ao090701000842
GI:57227606Cryptococcus neoformans
Ustilago maydis UM00196.1
SPAC5D6.09cSchizosaccharomyces pombe
TC 2.A.96.1.3
TC 2.A.96.1.4
AcpA
YaaH
Ady2
0.83
0.87
0.690.56
0.95
0.83
0.56
0.79100
0.72
0.76
0.78
0.61
0.72
0.79
0.79
0.9493
74
59
100
10088
74
75
75
70
92
62
100
100
0.66
0.96
0.99
0.96
0.96
0.60
0.85
0.60
0.99
0.8
5
0.9292
97
78
78
75
8060
71
99
0.98
0.98
0.850.940.2
32
Fig. S22.
Phylogenetic tree of amino acid-polyamine-organocation (APC) superfamily (TC 2.A.3). The unrooted Bayesian (52) tree calculated with a Blosum62+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+G+F) below the branches. Thick branches indicate 1.0 posterior probability. Curly brackets indicate different families of the amino acid-polyamine-organocation superfamily, according to best BLAST hits in the Transporter Classification Database (38). Color coding of major phylogenetic groups: Archaea, Bacteria, Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Alveolata, Excavata, Amoebozoa, Fungi, Animals. Bar represents 0.2 changes per site.
Land Plants
Stramenopiles
Phytophthora ramorum Phyra1_1_95019
DDB0189332Dictyostelium discoideum
138431Phytophthora sojae
Cyanidioschyzon merolae CMR055C
Gasu_23570Galdieria sulphuraria
Gasu_35110Galdieria sulphuraria
151.m00099Entamoeba histolytica
EHI 020320Entamoeba histolytica
Excavata The
Po
lya
min
e:H
Sym
port
er
(PH
S)
fam
il y (
TC
)
+
2.A
.3.1
2
Animals
Fungi
Fungi
Fungi
CHG02138.1Chaetomium globosum
Trichomonas vaginalis
AAW41080.1Cryptococcus neoformans
SSO1009Sulfolobus solfataricus
Gasu_21130Galdieria sulphuraria
Gasu_65340Galdieria sulphuraria
Gasu_34450Galdieria sulphuraria
Gasu_54580Galdieria sulphuraria
Ta0877Thermoplasma acidophilum
TVN0718Thermoplasma volcanium
Archaea
CME062CCyanidioschyzon merolae
Gasu_28250Galdieria sulphuraria
Phatr2_15324Phaeodactylum tricornutum
Thaps3_13922Thalassiosira pseudonana
Tb927.6.4660 [2.A.3.3.1]Trypanosoma brucei
Tc00.104705350683Trypanosoma cruzi
Land Plants
Bacteria
Green Algae
3386891 ctrADictyostelium discoideum
3386917 ctrCDictyostelium discoideum
Phytophthora ramorum Phyra1_1_72585
Phytophthora sojae 130075
3392553 ctrBDictyostelium discoideum
Animals
Land Plants
Ca
tion
ic A
min
o a
cid
Tra
nsp
orte
r (C
AT
)fa
mily
(T
C
)2.
A.3
.3
0.99
0.6762
9592
79
9899
99
99
53
53
92
9356
59
99
75
79
76
98
98
93
8992
97
100
100
100
100
100
100
100
100
100
100
100
100
0.50
0.92
0.59
0.87
0.610.88
0.96
0.97
0.95
0.84
0.84
0.88
0.81
0.5
70.
50
0.5
8
0.7
9
0.2
33
Fig. S23.
Phylogenetic tree of transporters from the major intrinsic protein (MIP; TC 1.A.8) family. The unrooted Bayesian (52) tree calculated with a Blosum62+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+I+G+F) below the branches. Thick branches indicate ≥0.99 posterior probability. The glycerol uptake facilitator (TC 1.A.8.1.1) subfamily of the MIP family according to best BLAST hits at the Transporter Classification Database (38) is indicated by a curly bracket. All other sequences shown belong to different MIP subfamilies. Color coding of major phylogenetic groups: Bacteria, Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Alveolata, Excavata, Fungi, Animals & Choanoflagellates. Bar represents 0.2 changes per site.
Land Plants
Drosophila melanogaster FBgn0015872
ENSAPMP00000029623Apis mellifera
Naegr1_75649Naegleria gruberi
Chlre3_190851Chlamydomonas reinhardtii
Monbr1_29850Monosiga brevicollis
DDB0214915Dictyostelium discoideum
Auran1_29268Aureococcus anophagefferens
NCU08052Neurospora crassa
Cryptococcus neoformans gi:57225766
Phatr2_44871Phaeodactylum tricornutum
Bacteria
Gasu_62400Galdieria sulphuraria
Gasu_00080Galdieria sulphuraria
Gasu_64190Galdieria sulphuraria
Gasu_32620Galdieria sulphuraria
Gasu_28080Galdieria sulphuraria
Proteobacteria
Bacteria
Bacteria
gi:348669471Phytophthora sojae
gi:348669470Phytophthora sojae
gi:325188770Albugo laibachii
Phyra1_1_72236Phytophthora ramorum
Phyra1_1_72237Phytophthora ramorum
PF11_0338Plasmodium falciparum
1963.m00065Plasmodium yoelii
Animals
Tb10.61.2650Trypanosoma brucei
Tb10.61.2640Trypanosoma brucei
Tb927.6.1520Trypanosoma brucei
LmjF31.0020Leishmania major
Fungi
0.96
0.66
0.68
0.7
9
0.94
0.98
0.93
64
97
72
67
82
86
80
99
75
6390
97
55
92
100
100
100
100
100
100
100
100
100
0.65
0.58
0.2
Gly
cer
ol u
pta
ke f
acil
itat
or
(TC
1.A
.8.1
.1)
34
Fig. S24.
Phylogenetic tree of glycerol dehydrogenases (EC 1.1.1.6). The unrooted Bayesian (52) tree calculated with a WAG+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+I+G) below the branches. Thick branches indicate 1.0 posterior probability. The collapsed branch labeled ‘Bacteria (mostly Firmicutes)’ contains sequences from two Actinobacteria, two Synergistetes, and 15 Firmicutes. Color coding of major phylogenetic groups: Cyanobacteria, other Bacteria, Rhodophyta, Excavata, Amoebozoa, Fungi. Bar represents 0.2 changes per site.
Bacteria (mostly Firmicutes)
gi:257125507 (Fusobacteria)Leptotrichia buccalis
Dictyostelium discoideum DDB0217165
gi:330844838Dictyostelium purpureum
sp. gi:312111935 (Firmicutes)Geobacillus
gi:237808641 (Proteobacteria)Tolumonas auensis
gi:310778873 (Fusobacteria)Ilyobacter polytropus
gi:51892405 (Firmicutes)Symbiobacterium thermophilum
gi:51892405 (Firmicutes)Symbiobacterium thermophilum
sp. gi:254229761 (Proteobacteria)VibrioBacteria (Proteobacteria)
gi:289523705 (Synergistetes)Anaerobaculum hydrogeniformans
gi:357419061 (Synergistetes)Thermovirga lienii
gi:296132160 (Firmicutes)Thermincola potens
sp. gi:242133534Crithidia
gi:302391592 (Firmicutes)Acetohalobium arabaticum
gi:188587303 (Firmicutes)Natranaerobius thermophilus
Gasu_57960Galdieria sulphuraria
Gasu_03180Galdieria sulphuraria
Gasu_57980Galdieria sulphuraria
gi:81300462 Synechococcus elongatus
gi:88703938 (Proteobacteria)Congregibacter litoralis
gamma proteobacterium NOR5-3 gi:254515406
Schizophyllum commune gi:302691994
gi:302696107Schizophyllum communeFungi (Basidiomycota)
mGDHSchizosaccharomyces pombe
gi:21340879Schizosaccharomyces japonicusFungi (Ascomycota)
gi:228912388 (Firmicutes)Bacillus thuringiensisBacteria (Firmicutes)
0.98
0.6
3
0.6
4
0.96
0.83
0.98
0.9
4
0.78 75
54
6265
52
7699
88
92
62
100
100
100
100
100
100
0.9
0
0.9
0
0.9
0
0.2
35
Fig. S25.
Phylogenetic tree of the Pi:H+ symporter (PHS) family and related families of the MFS superfamily. The unrooted Bayesian (52) tree calculated with a Blosum62+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+I+G+F) below the branches. Curly brackets indicate different families of the MFS (major facilitator superfamily) according to the Transporter Classification Database (38). Color coding of major phylogenetic groups: Archaea, Bacteria, Streptophyta, Chlorophyta, Rhodophyta, Stramenopiles, Alveolata, Amoebozoa, Fungi, Animals. Bar represents 0.5 changes per site.
GI:154707566Coxiella burnetii
GI:153207273Coxiella burnetii
sp. GI:195953069Hydrogenobaculum
Archaea
GI 14324338Thermoplasma volcanium
0.7
2
57 56
63100
Animals
Ph
osp
ha
te:H
+ S
ymp
ort
er (
TC
2.A
.1.9
)
Oth
er
MF
S T
ran
sp
ort
er
Animals
392.m00008Tetrahymena thermophila
3687.m00194Tetrahymena thermophila
3798.m00200Tetrahymena thermophila
PHUM188680Pediculus humanus
CPIJ011499Culex quinquefasciatus
CBP13872Caenorhabditis briggsae
Ostta4_19850Ostreococcus tauri
OSTLU_38010Ostreococcus lucimarinus
PHYPADRAFT136Physcomitrella
LOC100125647Zea mays
AT3G13050Arabidopsis thaliana
Os09g0559800Oryza sativa
0.98
0.9
1
0.97
0.83
61
66
98
97
97
9688
100
100
100100
100
100
Fungi
Gasu_10420Galdieria sulphuraria
Gasu_46230Galdieria sulphuraria
Gasu_07040Galdieria sulphuraria
Gasu_24210Galdieria sulphuraria
Gasu_13330Galdieria sulphuraria
CMT284CCyanidioschyzon merolae
Galdieria sulphuraria
Gasu_28690Galdieria sulphuraria
Gasu_23540Galdieria sulphuraria
Gasu_42700Galdieria sulphuraria
Phatr2_17265Phaeodactylum tricornutum
Thaps3_263579Thalassiosira pseudonana
Chlre3_133698Chlamydomonas reinhardtii
Chlre3_135952Chlamydomonas reinhardtii
Chlre3_183357Chlamydomonas reinhardtii
Fungi
Phyra1_1_76689Phytophthora ramorum
76690Phytophthora ramorum Phyra1_1_
76686Phytophthora ramorum Phyra1_1_
GI:167393077Entamoeba dispar
DDB0238850Dictyostelium discoideum
Fungi
GSPATP00019755001Paramecium tetraurelia
GSPATP00022308001Paramecium tetraurelia
Land Plants
0.66
0.84
0.57
74
97
96
6185
84
83
97
95
51
54
57
57
98
96
95
9585
100
100
100
100
100
100
100
100
100
100
0.950
.90
0.86
0.5
36
Fig. S26.
Phylogenetic tree of -galactosidases (EC 3.2.1.23; glycosyl hydrolases family 15), indicating two different subfamilies in G. sulphuraria. The unrooted Bayesian (52) tree calculated with a WAG+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+I+G) below the branches. Thick branches indicate 1.0 posterior probability. Numbers in brackets give number of introns for different G. sulphuraria genes; ‘SP’ indicates a secretory signal peptide according to SignalP 4.0 (39). Color coding of major phylogenetic groups: Bacteria, Streptophyta, Rhodophyta, Stramenopiles, Alveolata, Amoebozoa, Fungi, Animals & Choanoflagellates. Bar represents 0.2 changes per site.
Land Plants
DDBDRAFT_0186630Dictyostelium discoideum
PPL 03214Polysphondylium pallidum
GI:188501572Adineta vaga
Stramenopiles
CMP078CCyanidioschyzon merolae
Gasu_40850.1 (10)Galdieria sulphuraria
Gasu_03030.1 (6)Galdieria sulphuraria
Gasu_61020.1 (5)Galdieria sulphuraria
Gasu_31140.1 (1)Galdieria sulphuraria
Bacteria
Gasu_09330.1 (0)Galdieria sulphuraria
Gasu_27490.1 (1) ( )Galdieria sulphuraria SP
Gasu_27500.1 (0) ( )Galdieria sulphuraria SP
Fungi
32.m00226Tetrahymena thermophila
32.m00217Tetrahymena thermophila
Monbr1 26329Monosiga brevicollis
Bacteria
Animals
sp. GI:192814064Geobacillus
s sp. GI:197330841Streptomyce
GI:197932226Streptomyces scabies
Land Plants
Auran1_31836Aureococcus anophagefferens
CBP02650Caenorhabditis briggsae
CBP16998Caenorhabditis briggsae
Animals
0.9774
73
9784
100
100
99
100
100
100
100
100
100
100
100
100
100
100
0.960.83
0.76
96
0.6
0
0.75
0.66
0.5
8 0.99550.99
0.6
8
0.68 89
0.6
8
560.99
0.2
37
Fig. S27.
Phylogenetic tree of acid phosphatases (EC 3.1.3.2). The unrooted Bayesian (52) tree calculated with a WAG+I+G model of protein evolution shows posterior probabilities above the branches and PhyML (75) percent bootstrap support (using LG+I+G) below the branches. Thick branches indicate 1.0 posterior probability. ‘SP’ indicates a secretory signal peptide according to SignalP 4.0 (64). Curly bracket marks a group of organisms that all are acidophilic. Color coding of major phylogenetic groups: Cyanobacteria, other Bacteria, Streptophyta, Rhodophyta, Excavata, Fungi. Bar represents 0.2 changes per site.
Fungi
gi:328772443Batrachochytrium dendrobatidis
Naegr1_80092Naegleria gruberi
Phypa1_1_51671Physcomitrella
sp. gi:209517346 (Betaproteobacteria)Burkholderia
sp. gi:325521954 (Betaproteobacteria)Burkholderia
gi:225874284 (Acidobacteria)Acidobacterium capsulatum
gi:167567471 (Betaproteobacteria)Burkholderia oklahomensis
sp. gi:309779586 (Betaproteobacteria)Ralstonia
sp. gi:119717424 (Actinobacteria)Nocardioides
gi:359774813 (Actinobacteria)Arthrobacter globiformis
(Acidobacteria)Candidatus Koribacter versatilis
gi:117927441 (Actinobacteria)Acidothermus cellulolyticus
gi:388442093 (Gammaproteobacteria)Rhodanobacter thiooxydans
gi:182678150 (Alphaproteobacteria)Beijerinckia indica
gi:186682052 (Cyanobacteria)Nostoc punctiforme
gi:269839860Thermobaculum terrenum
(Nitrospirae)Leptospirillum
gi:296164862 (Actinobacteria)Mycobacterium parascrofulaceum
gi:342861237 (Actinobacteria)Mycobacterium colombiense
gi:339629066 (Firmicutes)Sulfobacillus acidophilus
gi:339626434 (Firmicutes)Sulfobacillus acidophilus
sp. gi:294341316 (Betaproteobacteria)Thiomonas
Gasu_56160 ( )Galdieria sulphuraria SP
Gasu_24940Galdieria sulphuraria ( )SP
Gasu_54390Galdieria sulphuraria
Gasu_20960Galdieria sulphuraria
Gasu_06690Galdieria sulphuraria
0.97 97
97
86
62
63
55
64
99
94
97
80
68
100
100
100
100
100
100
100
100
100
0.95
0.76
0.7
5
0.6
40.
55
0.7
40
.85
0.5
9
0.2
acid
op
hili
c
38
Table S1.
Orthologous relationships between G. sulphuraria and C. merolae genomes. For intron and exon size distribution in G. sulphuraria see fig. S5, for a comparison of intergenic distances see fig. S3.
G. sulphuraria C. merolae
Genome size (Mb) 13.7 16.5
GC (%) 37.7 55.0
GC (%) coding sequence (excluding introns) 38.6 56.7
CpG occurrence (Obs./Exp.) 0.715 1.151
CpG islands 27 2
Repeat content 713 313
Number of rRNA units (18S/5.8S/28S + 5S) 4 + 10 3 + 3
Number of tRNAs 85 30
Predicted proteins 6623 5771
Gene density (kb per gene) 2.07 2.86
Average gene length (bp) 1601 1553
Average transcript length (bp) 1388 1552
Average number of amino acids per polypeptide 421 518
Average number of exons per gene 3.16 1.005
Average exon length (bp) 417.3 1527.6
Introns 13630 26
Genes with introns (%) 72.4 0.5
Average intron length (bp) 56.5 248
Median intergenic distance (bp) 20.0 1404.5
Coding sequence (%) 77.5 44.9
39
Table S2.
Large protein families in G. sulphuraria. Families of paralogous proteins were determined by BLASTing the G. sulphuraria genome against itself, followed by MCL clustering (see Materials and Methods). Protein families were ranked by size. When annotations differed for members of one family, the annotation for the majority is given. Representative protein domains are indicated by PF numbers (76) or IPR numbers (33); brackets indicate low significance or hits in only some family members.
Rank Size Annotation Protein domains
1 131 Archaeal ATPase PF01637
2 76 Archaeal ATPase PF01637 & PF00536
3 48 Sugar Porter (SP) family of MFS superfamily PF00083
4 29 serine/threonine protein kinase IPR002290
5 26 AAA ATPase PF00004
6 25 DEAD-box RNA helicase PF00270 & PF00271
7 22 Pi:H+ symporter (PHS) family of MFS superfamily PF07690
8 15 Mitochondrial Carrier (MC) family PF00153
9 13 Amino Acid / Auxin Permease (AAAP) family PF01490
10 13 aldehyde dehydrogenase (NAD+) PF00171
11 12 cyclin-dependent serine/threonine protein kinase IPR002290
12 12 ubiquitin-conjugating enzyme E2 PF00179
13 12 kinesin family member PF00225
14 11 ABC transporter, multidrug resistance PF01061
15 11 ADP-ribosylation factor PF00025
16 11 putative acetate transporter PF01184
17 11 hypothetical protein (PF01113)
18 11 hypothetical protein (IPR008238)
19 10 Archaeal ATPase PF01637
20 10 hypothetical protein (IPR015598)
40
Table S3.
Comparison of membrane transport proteins in G. sulphuraria and C. merolae. Annotated membrane transport proteins classified as ‘ATP-Dependent’, ‘Ion Channel’, or ‘Secondary Transporter’ are listed, and the number of transporters per family is given. The classification of transporters into families follows the Transporter Classification Database (38, 77). Species and numbers of transporters per family are as in fig. S18. Transporter numbers were normalized to total number of proteins for each genome before calculating z-scores (z-scores >2.0 in bold).
z-score ATP-Dependent 60 102 0.58 2.12
The ATP-binding Cassette (ABC) Superfamily 28 49 0.24 1.42 The Arsenite-Antimonite (ArsA) Efflux Family ATP-hydrolizing Subunit 3 3 3.44 2.85
The H+- or Na+-translocating F-type, V-type and A-type ATPase (F-ATPase) 3 3 1.72 1.33 The H+-translocating Pyrophosphatase (H+-PPase) Family 1 3 0.27 1.97
The Type II (General) Secretory Pathway (IISP) Family 4 11 -0.02 1.82 The Mitochondrial Protein Translocase (MPT) Family 11 18 1.01 1.89
The P-type ATPase (P-ATPase) Superfamily 10 15 -0.24 0.52 Ion Channel 13 19 -0.61 -0.46
The Ammonia Transporter Channel (Amt) Family 2 2 0.38 0.19 The Anion Channel-forming Bestrophin (Bestrophin) Family 4 0 3.17 -0.36
The Intracellular Chloride Channel (CLIC) Family 0 4 -0.42 5.11 The Major Intrinsic Protein (MIP) Family 1 5 -0.46 0.84
The CorA Metal Ion Transporter (MIT) Family 0 3 -1.12 0.82 The Small Conductance Mechanosensitive Ion Channel (MscS) Family 2 2 0.21 0.10
The Chloroplast Envelope Anion Channel-forming Tic110 (Tic110) Family 1 1 3.03 2.57 The Voltage-gated Ion Channel (VIC) Superfamily 3 1 -0.49 -0.64
Secondary Transporter 108 225 -0.24 1.08 The ATP:ADP Antiporter (AAA) Family 1 1 0.32 0.24
The Amino Acid / Auxin Permease (AAAP) Family 1 17 -0.76 0.89 The Anion Exchanger (AE) Family 3 2 1.98 0.81
The Auxin Efflux Carrier (AEC) Family 0 1 -0.75 0.02 The Amino Acid-Polyamine-Organocation (APC) Family 2 8 -0.39 0.24
The Arsenite-Antimonite (ArsB) Efflux Family Transmembrane Subunit 0 2 -0.56 3.55 The Bile Acid:Na+ Symporter (BASS) Family 3 3 1.56 1.27
The Ca2+:Cation Antiporter (CaCA) Family 1 1 -0.70 -0.77 The Cation Diffusion Facilitator (CDF) Family 3 1 0.50 -1.23
The Chloride Carrier/Channel (ClC) Family 2 5 0.34 2.39 The Monovalent Cation:Proton Antiporter-1 (CPA1) Family 4 5 1.40 1.68 The Monovalent Cation:Proton Antiporter-2 (CPA2) Family 2 2 0.62 0.44
The Divalent Anion:Na+ Symporter (DASS) Family 2 1 0.75 -0.21 The Drug/Metabolite Transporter (DMT) Superfamily 18 20 0.79 0.72
The Folate-Biopterin Transporter (FBT) Family 1 0 -0.30 -0.72 The Glycoside-Pentoside-Hexuronide (GPH):Cation Symporter Family 1 5 -0.03 2.71
The Hydroxy/Aromatic Amino Acid Permease (HAAAP) Family 0 1 -0.51 0.97 The K+ Uptake Permease (KUP) Family 1 2 0.62 1.34
The Lysosomal Cystine Transporter (LCT) Family 2 2 1.86 1.48 The Mitochondrial Carrier (MC) Family 31 34 1.68 1.52
The Major Facilitator Superfamily (MFS) 13 83 -0.62 0.76 The Multidrug/Oligosaccharidyl-lipid/Polysaccharide (MOP) Flippase Superfamily 2 3 -0.25 -0.06
The Nucleobase:Cation Symporter-1 (NCS1) Family 0 2 -0.46 0.46 The Nucleobase:Cation Symporter-2 (NCS2) Family 0 2 -0.67 1.47
The NhaD Na+:H+ Antiporter (NhaD) Family 1 1 3.72 3.18 The Metal Ion (Mn2+-iron) Transporter (Nramp) Family 3 4 2.70 3.28
The Cytochrome Oxidase Biogenesis (Oxa1) Family 2 5 1.12 3.63 The Inorganic Phosphate Transporter (PiT) Family 2 1 1.18 -0.01
The Proton-dependent Oligopeptide Transporter (POT) Family 0 1 -0.52 -0.22 The Resistance-Nodulation-Cell Division (RND) Superfamily 0 1 -0.93 -0.54
The Solute:Sodium Symporter (SSS) Family 1 1 -0.08 -0.17 The Sulfate Permease (SulP) Family 2 2 -0.11 -0.26
The Twin Arginine Targeting (Tat) Family 0 2 -0.40 2.93 The Telurite-resistance/Dicarboxylate Transporter (TDT) Family 1 1 0.67 0.52
The K+ Transporter (Trk) Family 1 0 0.80 -0.63 The Zinc (Zn2+)-Iron (Fe2+) Permease (ZIP) Family 2 3 -0.66 -0.28
41
Additional data table S4.
Genes in G. sulphuraria probably originating from horizontal gene transfer (HGT) and possible ‘donors’ from which they descended. From left to right: gene identifiers; protein family (ranked by size); number of genes/paralogs resulting from the gene transfer (a number in brackets indicates that not all genes in this family result from HGT and gives the total number of family members); annotation of the encoded protein; for those cases where a detailed phylogenetic analysis was performed, the figure displaying the phylogenetic tree; organism that is the closest sequenced relative (or descendant) of the ‘donor’ organism from which the gene might have originated, according to best BLAST hits or phylogenetic tree; typical habitat of this ‘donor’ organism, with ‘M’ standing for mesophilic (optimum growth temperature between 15°C and 45°C), ‘T’ for thermophilic (OGT 45-80°C), ‘HT’ for hyperthermophilic (OGT > 80°C), and ‘TA’ for thermoacidophilic; systematic position of this ‘donor’ organism with domain, phylum, and order.
42
References and Notes 1. W. N. Doemel, T. D. Brock, The physiological ecology of Cyanidium caldarium. J. Gen.
Microbiol. 67, 17 (1971). doi:10.1099/00221287-67-1-17
2. L. J. Rothschild, R. L. Mancinelli, Life in extreme environments. Nature 409, 1092 (2001). doi:10.1038/35059215 Medline
3. W. Gross, C. Schnarrenberger, Heterotrophic growth of two strains of the acido-thermophilic red alga Galdieria sulphuraria. Plant Cell Physiol. 36, 633 (1995).
4. V. Reeb, D. Bhattacharya, in The thermo-acidophilic Cyanidiophyceae (Cyanidiales) in Red Algae in the Genomic Age, J. Seckbach, D. J. Chapman, Eds. (Springer, Netherlands, 2010), pp. 409–426.
5. G. Pinto, C. Ciniglia, C. Cascone, A. Pollio, Species composition of Cyanidiales assemblages in Pisciarelli (Campi Flegrei, Italy) and description of Galdieria phlegrea SP. NOV in Algae and Cyanobacteria in Extreme Environments, J. Seckbach, Ed. (Springer Netherlands, 2007), vol. 11, pp. 487-502.
6. Materials and methods are available as supplementary materials on Science Online.
7. M. Matsuzaki et al., Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D. Nature 428, 653 (2004). doi:10.1038/nature02398 Medline
8. H. Innan, F. Kondrashov, The evolution of gene duplications: Classifying and distinguishing between models. Nat. Rev. Genet. 11, 97 (2010). doi:10.1038/nrg2689 Medline
9. T. J. Treangen, E. P. C. Rocha, Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes. PLoS Genet. 7, e1001284 (2011). doi:10.1371/journal.pgen.1001284 Medline
10. C. Bowler et al., The Phaeodactylum genome reveals the evolutionary history of diatom genomes. Nature 456, 239 (2008). doi:10.1038/nature07410 Medline
11. P. J. Keeling, Functional and ecological impacts of horizontal gene transfer in eukaryotes. Curr. Opin. Genet. Dev. 19, 613 (2009). doi:10.1016/j.gde.2009.10.001 Medline
12. R. Overbeek et al., The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33, 5691 (2005). doi:10.1093/nar/gki866 Medline
13. E. V. Koonin, Evidence for a family of archaeal ATPases. Science 275, 1489 (1997). doi:10.1126/science.275.5305.1489 Medline
14. J. G. McCoy et al., Discovery of sarcosine dimethylglycine methyltransferase from Galdieria sulphuraria. Proteins Struct. Funct. Bioinformatics 74, 368 (2009). doi:10.1002/prot.22147 Medline
15. I. Enami, H. Akutsu, Y. Kyogoku, Intracellular pH regulation in an acidophilic unicellular alga, Cyanidium caldarium: 31P-NMR determination of intracellular pH. Plant Cell Physiol. 27, 1351 (1986).
16. W. Gross, Ecophysiology of algae living in highly acidic environments. Hydrobiologia 433, 31 (2000). doi:10.1023/A:1004054317446
43
17. J. Ye, C. Rensing, B. P. Rosen, Y.-G. Zhu, Arsenic biomethylation by photosynthetic organisms. Trends Plant Sci. 17, 155 (2012). doi:10.1016/j.tplants.2011.12.003 Medline
18. C. Oesterhelt, S. Vogelbein, R. P. Shrestha, M. Stanke, A. P. M. Weber, The genome of the thermoacidophilic red microalga Galdieria sulphuraria encodes a small family of secreted class III peroxidases that might be involved in cell wall modification. Planta 227, 353 (2008). doi:10.1007/s00425-007-0622-z Medline
19. C. Pál, B. Papp, M. J. Lercher, Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nat. Genet. 37, 1372 (2005). doi:10.1038/ng1686 Medline
20. D. D. Leipe, E. V. Koonin, L. Aravind, STAND, a class of P-loop NTPases including animal and plant regulators of programmed cell death: Multiple, complex domain architectures, unusual phyletic patterns, and evolution by horizontal gene transfer. J. Mol. Biol. 343, 1 (2004). doi:10.1016/j.jmb.2004.08.023 Medline
21. W. L. Chiu et al., Oenothera chloroplast DNA polymorphisms associated with plastome mutator activity. Mol. Gen. Genet. 221, 59 (1990). doi:10.1007/BF00280368
22. S. Batzoglou et al., ARACHNE: A whole-genome shotgun assembler. Genome Res. 12, 177 (2002). doi:10.1101/gr.208902 Medline
23. A. P. M. Weber, K. L. Weber, K. Carr, C. Wilkerson, J. B. Ohlrogge, Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol. 144, 32 (2007). doi:10.1104/pp.107.096677 Medline
24. M. Stanke, O. Schöffmann, B. Morgenstern, S. Waack, Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006). doi:10.1186/1471-2105-7-62 Medline
25. A. P. M. Weber et al., EST-analysis of the thermo-acidophilic red microalga Galdieria sulphuraria reveals potential for lipid A biosynthesis and unveils the pathway of carbon export from rhodoplasts. Plant Mol. Biol. 55, 17 (2004). doi:10.1007/s11103-004-0376-y Medline
26. K. D. Pruitt, T. Tatusova, D. R. Maglott, NCBI reference sequences (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35, (Database issue), D61 (2007). doi:10.1093/nar/gkl842 Medline
27. A. Marchler-Bauer et al., CDD: A Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res. 39 (Database issue), D225 (2011). doi:10.1093/nar/gkq1189 Medline
28. The Arabidopsis Genome Initiative, Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796 (2000). doi:10.1038/35048692 Medline
29. M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe, M. Hirakawa, KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38 (Database issue), D355 (2010). doi:10.1093/nar/gkp896 Medline
30. UniProt Consortium, The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 38 (Database issue), D142 (2010). doi:10.1093/nar/gkp846 Medline
44
31. M. Ashburner et al., Gene ontology: Tool for the unification of biology. Nat. Genet. 25, 25 (2000). doi:10.1038/75556 Medline
32. http://mapman.gabipd.org/web/guest/app/mercator
33. S. Hunter et al., InterPro: The integrative protein signature database. Nucleic Acids Res. 37 (Database issue), D211 (2009). doi:10.1093/nar/gkn785 Medline
34. J. Söding, A. Biegert, A. N. Lupas, The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 33, (Web Server issue), W244 (2005). doi:10.1093/nar/gki408 Medline
35. S. F. Altschul et al., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389 (1997). doi:10.1093/nar/25.17.3389 Medline
36. Q. Ren, K. Chen, I. T. Paulsen, TransportDB: A comprehensive database resource for cytoplasmic membrane transport systems and outer membrane channels. Nucleic Acids Res. 35 (Database issue), D274 (2007). doi:10.1093/nar/gkl925 Medline
37. H. Li, V. A. Benedito, M. K. Udvardi, P. X. Zhao, TransportTP: A two-phase classification approach for membrane transporter prediction and characterization. BMC Bioinformatics 10, 418 (2009). doi:10.1186/1471-2105-10-418 Medline
38. M. H. Saier, Jr., M. R. Yen, K. Noto, D. G. Tamang, C. Elkan, The Transporter Classification Database: Recent advances. Nucleic Acids Res. 37 (Database issue), D274 (2009). doi:10.1093/nar/gkn862 Medline
39. T. N. Petersen, S. Brunak, G. von Heijne, H. Nielsen, SignalP 4.0: Discriminating signal peptides from transmembrane regions. Nat. Methods 8, 785 (2011). doi:10.1038/nmeth.1701 Medline
40. B. R. Zeeberg et al., High-Throughput GoMiner, an ‘industrial-strength’ integrative gene ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID). BMC Bioinformatics 6, 168 (2005). doi:10.1186/1471-2105-6-168 Medline
41. M. Remm, C. E. V. Storm, E. L. L. Sonnhammer, Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041 (2001). doi:10.1006/jmbi.2000.5197 Medline
42. S. S. Merchant et al., The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science 318, 245 (2007). doi:10.1126/science.1143609 Medline
43. R. C. Edgar, MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113 (2004). doi:10.1186/1471-2105-5-113 Medline
44. G. Talavera, J. Castresana, Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 56, 564 (2007). doi:10.1080/10635150701472164 Medline
45. F. Abascal, R. Zardoya, D. Posada, ProtTest: Selection of best-fit models of protein evolution. Bioinformatics 21, 2104 (2005). doi:10.1093/bioinformatics/bti263 Medline
45
46. S. Guindon et al., New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Syst. Biol. 59, 307 (2010). doi:10.1093/sysbio/syq010 Medline
47. Z. Yang, PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586 (2007). doi:10.1093/molbev/msm088 Medline
48. W.-H. Chen, M. J. Lercher, ColorTree: A batch customization tool for phylogenic trees. BMC Res. Notes 2, 155 (2009). doi:10.1186/1756-0500-2-155 Medline
49. A. Agresti, A survey of exact inference for contingency tables. Stat. Sci. 7, 131 (1992). doi:10.1214/ss/1177011454
50. K. Tamura, J. Dudley, M. Nei, S. Kumar, MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24, 1596 (2007). doi:10.1093/molbev/msm092 Medline
51. J.-F. Taly et al., Using the T-Coffee package to build multiple sequence alignments of protein, RNA, DNA sequences and 3D structures. Nat. Protoc. 6, 1669 (2011). doi:10.1038/nprot.2011.393 Medline
52. G. Altekar, S. Dwarkadas, J. P. Huelsenbeck, F. Ronquist, Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20, 407 (2004). doi:10.1093/bioinformatics/btg427 Medline
53. D. H. Huson et al., Dendroscope: An interactive viewer for large phylogenetic trees. BMC Bioinformatics 8, 460 (2007). doi:10.1186/1471-2105-8-460 Medline
54. H. Zhang, S. Gao, M. J. Lercher, S. Hu, W.-H. Chen, EvolView, an online tool for visualizing, annotating and managing phylogenetic trees. Nucleic Acids Res. 40, (Web Server issue), W569 (2012). doi:10.1093/nar/gks576 Medline
55. J. E. Blair, Animals (Metazoa) in The Timetree of Life, S. B. Hedges, S. Kumar, Eds. (Oxford Univ. Press, New York, 2009), pp. 223-230.
56. C. Ciniglia, H. S. Yoon, A. Pollio, G. Pinto, D. Bhattacharya, Hidden biodiversity of the extremophilic Cyanidiales red algae. Mol. Ecol. 13, 1827 (2004). doi:10.1111/j.1365-294X.2004.02180.x Medline
57. H. S. Yoon, J. D. Hackett, C. Ciniglia, G. Pinto, D. Bhattacharya, A molecular timeline for the origin of photosynthetic eukaryotes. Mol. Biol. Evol. 21, 809 (2004). doi:10.1093/molbev/msh075 Medline
58. D. Bhattacharya, H. S. Yoon, S. B. Hedges, J. D. Hackett, in Eukaryotes (Eukaryota) in The Timetree of Life, S. B. Hedges, S. Kumar, Eds. (Oxford Univ. Press, New York, 2009), pp. 116–120.
59. P. J. Keeling, J. D. Palmer, Horizontal gene transfer in eukaryotic evolution. Nat. Rev. Genet. 9, 605 (2008). doi:10.1038/nrg2386 Medline
60. J. Huang, J. P. Gogarten, Concerted gene recruitment in early plant evolution. Genome Biol. 9, R109 (2008). doi:10.1186/gb-2008-9-7-r109 Medline
46
61. J. N. Timmis, M. A. Ayliffe, C. Y. Huang, W. Martin, Endosymbiotic gene transfer: Organelle genomes forge eukaryotic chromosomes. Nat. Rev. Genet. 5, 123 (2004). doi:10.1038/nrg1271 Medline
62. A. Criscuolo, S. Gribaldo, Large-scale phylogenomic analyses indicate a deep origin of primary plastids within cyanobacteria. Mol. Biol. Evol. 28, 3019 (2011). doi:10.1093/molbev/msr108 Medline
63. S. G. E. Andersson et al., The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396, 133 (1998). doi:10.1038/24094 Medline
64. J. G. Lawrence, H. Ochman, Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. U.S.A. 95, 9413 (1998). doi:10.1073/pnas.95.16.9413 Medline
65. J. Bohlin, E. Skjerve, D. W. Ussery, Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes. BMC Genomics 9, 104 (2008). doi:10.1186/1471-2164-9-104 Medline
66. C. S. Smillie et al., Ecology drives a global network of gene exchange connecting the human microbiome. Nature 480, 241 (2011). doi:10.1038/nature10571 Medline
67. R. W. Bailey, L. A. Staehelin, The chemical composition of isolated cell walls of Cyanidium caldarium. J. Gen. Microbiol. 54, 269 (1968). doi:10.1099/00221287-54-2-269 Medline
68. R. Asada, K. Tazaki, Silica biomineralization of unicellular microbes under strongly acidic conditions. Can. Mineral. 39, 1 (2001). doi:10.2113/gscanmin.39.1.1
69. R. P. Shrestha, A. P. Weber, Acidothermophilic red microalga Galdieria sulphuraria: From genome to an extracellular glucoamylase active at extreme low pH and high temperature. J. Phycol. 43 (suppl. 1), 23 (2007).
70. O. Emanuelsson, S. Brunak, G. von Heijne, H. Nielsen, Locating proteins in the cell using TargetP, SignalP and related tools. Nat. Protoc. 2, 953 (2007). doi:10.1038/nprot.2007.131 Medline
71. E. Derelle et al., Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. Proc. Natl. Acad. Sci. U.S.A. 103, 11647 (2006). doi:10.1073/pnas.0604795103 Medline
72. B. Palenik et al., The tiny eukaryote Ostreococcus provides genomic insights into the paradox of plankton speciation. Proc. Natl. Acad. Sci. U.S.A. 104, 7705 (2007). doi:10.1073/pnas.0611046104 Medline
73. E. V. Armbrust et al., The genome of the diatom Thalassiosira pseudonana: Ecology, evolution, and metabolism. Science 306, 79 (2004). doi:10.1126/science.1101156 Medline
74. http://genome.jgi-psf.org/Auran1/Auran1.home.html.
75. S. Guindon, O. Gascuel, A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696 (2003). doi:10.1080/10635150390235520 Medline
76. R. D. Finn et al., The Pfam protein families database. Nucleic Acids Res. 38 (Database issue), D211 (2010). doi:10.1093/nar/gkp985 Medline
47
77. M. H. Saier, Jr., C. V. Tran, R. D. Barabote, TCDB: The Transporter Classification Database for membrane transport protein analyses and information. Nucleic Acids Res. 34 (Database issue), D181 (2006). doi:10.1093/nar/gkj001 Medline
78. A. Marchler-Bauer et al., CDD: A conserved domain database for interactive domain family analysis. Nucleic Acids Res. 35 (Database issue), D237 (2007). doi:10.1093/nar/gkl951 Medline
79. D. Gagneul et al., A reassessment of the function of the so-called compatible solutes in the halophytic plumbaginaceae Limonium latifolium. Plant Physiol. 144, 1598 (2007). doi:10.1104/pp.107.099820 Medline