Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic...

9
Honey-beeassociated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential Ward Deboutte a,1 , Leen Beller a , Claude Kwe Yinda a,b , Piet Maes a , Dirk C. de Graaf c , and Jelle Matthijnssens a,1 a Department of Microbiology, Immunology, and Transplantation, Rega Institute for Medical Research, Division of Clinical and Epidemiological Virology, KU Leuven, BE3000 Leuven, Belgium; b Rocky Mountain Laboratories, Laboratory of Virology, Virus Ecology Unit, National Institute of Allergy and Infectious Diseases, NIH, Hamilton, MT 59840; and c Department of Biochemistry and Microbiology, Laboratory of Molecular Entomology and Bee Pathology, Ghent University, BE9000 Ghent, Belgium Edited by Nancy A. Moran, The University of Texas, Austin, TX, and approved March 20, 2020 (received for review December 12, 2019) Honey bees (Apis mellifera) produce an enormous economic value through their pollination activities and play a central role in the biodiversity of entire ecosystems. Recent efforts have revealed the substantial influence that the gut microbiota exert on bee devel- opment, food digestion, and homeostasis in general. In this study, deep sequencing was used to characterize prokaryotic viral commu- nities associated with honey bees, which was a blind spot in research up until now. The vast majority of the prokaryotic viral populations are novel at the genus level, and most of the encoded proteins com- prise unknown functions. Nevertheless, genomes of bacteriophages were predicted to infect nearly every major bee-gut bacterium, and functional annotation and auxiliary metabolic gene discovery imply the potential to influence microbial metabolism. Furthermore, undiscovered genes involved in the synthesis of secondary metabolic biosynthetic gene clusters reflect a wealth of previously untapped enzymatic resources hidden in the bee bacteriophage community. viral metagenomics | prokaryotic viruses | bacteriophages | Apis mellifera P ollination is an essential aspect for entire ecosystems, and honey bees (Apis mellifera) are considered the most eco- nomically important insect pollinators for commercial crops worldwide. Apart from the production of honey and other valuable products, honey bees contribute significantly to insect pollination, of which the economic value has been estimated at V153 billion (1). During the past decades, it has become clear that managed honey-bee colonies are under pressure from a wide variety of stressors, such as parasites (2), bacterial patho- gens (3), viral pathogens (4), and others such as chemical stressors (5). Recently, more and more attention is going toward the bee microbiota, and a number of studies have attempted to characterize the honey-bee-gut microbiome (6, 7). These studies have revealed that the bacterial part of the core honey-bee-gut microbiome is dominated by 5 to 10 different bacterial species. The species that were identified belonged to three different bacterial phyla, namely the Proteobacteria, Firmicutes, and Actinobacteria (6). Transcriptome analysis further provided in- formation on the functional potential encoded by the bacterial gut microbiome (8). From these insights a model was proposed for a microbial metabolic pathway, with different roles for dif- ferent bacteria. Briefly, glycosidases and peptidases (encoded by the aforementioned core bacterial microbiome) initially break down plant polysaccharides and proteins. These products are further fermented into organic acids, gases, and alcohols, which are then further metabolized by methanogens and Clostridia species. The fact that honey bees cannot survive on unprocessed pollen alone (9) highlights the importance of microbial enzy- matic digestion in honey-bee homeostasis. These findings were recently recapitulated in a study employing system-wide metab- olomics (10). This study confirms that the bee-gut microbiota play a central role in the digestion and metabolization of pollen- derived components. More evidence on the existence of hostmicrobe interactions has revealed a positive influence of the bee- gut microbiota on weight gain in the host weight of the gut compartments, but also increasing the endogenous expression of genes involved in development and immunity, sucrose sensitivity, and insulin-like signaling (11). Taken together, these results imply an essential role of the bee-gut microbiota in nutrition availability, bee development, and general homeostasis. This role is further strengthened by the observation that a diet-induced gut bacterial dysbiosis is associ- ated with detrimental effects on development, mortality, and disease susceptibility (12). The fact that both the diet of honey bees and the bacterial diversity present in the honey-bee gut are Significance This study uses viral-like particle purification and subsequent unbiased genome sequencing to identify prokaryotic viruses associated with Apis mellifera. Interestingly, bacteriophages found in honey bees show a high diversity and span different viral taxa. This diversity sharply contrasts with the state-of-the- art knowledge on the relatively simple bee bacterial micro- biome. The identification of multiple auxiliary metabolic genes suggests that these bacteriophages possess the coding poten- tial to intervene in essential microbial pathways related to health and possibly also to disease. This study sheds light on a neglected part of the bee microbiota and opens avenues of in vivo research on the interaction of bacteriophages with their bacterial host, which likely has strongly underappreciated consequences on bee health. Author contributions: W.D., D.C.d.G., and J.M. designed research; W.D., L.B., C.K.Y., and P.M. performed research; W.D. and L.B. analyzed data; and W.D. and J.M. wrote the paper. The authors declare no competing interest. This article is a PNAS Direct Submission. This open access article is distributed under Creative Commons Attribution-NonCommercial- NoDerivatives License 4.0 (CC BY-NC-ND). Data deposition: Retrieved prokaryotic viral sequences larger than 5 kb were deposited in the National Center for Biotechnology Information (NCBI) GenBank database (accession numbers available in SI Appendix, Table S18, available on GitHub). Raw reads were de- posited in NCBI s Sequence Read Archive (SRA) database under accession no. PRJNA579886 (SRA accession numbers are also available in SI Appendix, Table S18, avail- able on GitHub). Analysis notebooks are available at GitHub (https://github.com/ matthijnssenslab/beevir). All intermediate results files and outputs that were gener- ated, as well as the fasta sequences for nucleotides and proteins, are also available through the GitHub repository. See online for related content such as Commentaries. 1 To whom correspondence may be addressed. Email: [email protected] or [email protected]. This article contains supporting information online at https://www.pnas.org/lookup/suppl/ doi:10.1073/pnas.1921859117/-/DCSupplemental. First published April 27, 2020. www.pnas.org/cgi/doi/10.1073/pnas.1921859117 PNAS | May 12, 2020 | vol. 117 | no. 19 | 1051110519 MICROBIOLOGY Downloaded by guest on August 7, 2020

Transcript of Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic...

Page 1: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

Honey-bee–associated prokaryotic viral communitiesreveal wide viral diversity and a profound metaboliccoding potentialWard Debouttea,1, Leen Bellera, Claude Kwe Yindaa,b, Piet Maesa, Dirk C. de Graafc,and Jelle Matthijnssensa,1

aDepartment of Microbiology, Immunology, and Transplantation, Rega Institute for Medical Research, Division of Clinical and Epidemiological Virology, KULeuven, BE3000 Leuven, Belgium; bRocky Mountain Laboratories, Laboratory of Virology, Virus Ecology Unit, National Institute of Allergy and InfectiousDiseases, NIH, Hamilton, MT 59840; and cDepartment of Biochemistry and Microbiology, Laboratory of Molecular Entomology and Bee Pathology, GhentUniversity, BE9000 Ghent, Belgium

Edited by Nancy A. Moran, The University of Texas, Austin, TX, and approved March 20, 2020 (received for review December 12, 2019)

Honey bees (Apis mellifera) produce an enormous economic valuethrough their pollination activities and play a central role in thebiodiversity of entire ecosystems. Recent efforts have revealed thesubstantial influence that the gut microbiota exert on bee devel-opment, food digestion, and homeostasis in general. In this study,deep sequencing was used to characterize prokaryotic viral commu-nities associated with honey bees, which was a blind spot in researchup until now. The vast majority of the prokaryotic viral populationsare novel at the genus level, and most of the encoded proteins com-prise unknown functions. Nevertheless, genomes of bacteriophageswere predicted to infect nearly every major bee-gut bacterium,and functional annotation and auxiliary metabolic gene discoveryimply the potential to influence microbial metabolism. Furthermore,undiscovered genes involved in the synthesis of secondary metabolicbiosynthetic gene clusters reflect a wealth of previously untappedenzymatic resources hidden in the bee bacteriophage community.

viral metagenomics | prokaryotic viruses | bacteriophages | Apis mellifera

Pollination is an essential aspect for entire ecosystems, andhoney bees (Apis mellifera) are considered the most eco-

nomically important insect pollinators for commercial cropsworldwide. Apart from the production of honey and othervaluable products, honey bees contribute significantly to insectpollination, of which the economic value has been estimated atV153 billion (1). During the past decades, it has become clearthat managed honey-bee colonies are under pressure from awide variety of stressors, such as parasites (2), bacterial patho-gens (3), viral pathogens (4), and others such as chemicalstressors (5). Recently, more and more attention is going towardthe bee microbiota, and a number of studies have attempted tocharacterize the honey-bee-gut microbiome (6, 7). These studieshave revealed that the bacterial part of the core honey-bee-gutmicrobiome is dominated by 5 to 10 different bacterial species.The species that were identified belonged to three differentbacterial phyla, namely the Proteobacteria, Firmicutes, andActinobacteria (6). Transcriptome analysis further provided in-formation on the functional potential encoded by the bacterialgut microbiome (8). From these insights a model was proposedfor a microbial metabolic pathway, with different roles for dif-ferent bacteria. Briefly, glycosidases and peptidases (encoded bythe aforementioned core bacterial microbiome) initially breakdown plant polysaccharides and proteins. These products arefurther fermented into organic acids, gases, and alcohols, whichare then further metabolized by methanogens and Clostridiaspecies. The fact that honey bees cannot survive on unprocessedpollen alone (9) highlights the importance of microbial enzy-matic digestion in honey-bee homeostasis. These findings wererecently recapitulated in a study employing system-wide metab-olomics (10). This study confirms that the bee-gut microbiotaplay a central role in the digestion and metabolization of pollen-

derived components. More evidence on the existence of host–microbe interactions has revealed a positive influence of the bee-gut microbiota on weight gain in the host weight of the gutcompartments, but also increasing the endogenous expression ofgenes involved in development and immunity, sucrose sensitivity,and insulin-like signaling (11).Taken together, these results imply an essential role of the

bee-gut microbiota in nutrition availability, bee development,and general homeostasis. This role is further strengthened by theobservation that a diet-induced gut bacterial dysbiosis is associ-ated with detrimental effects on development, mortality, anddisease susceptibility (12). The fact that both the diet of honeybees and the bacterial diversity present in the honey-bee gut are

Significance

This study uses viral-like particle purification and subsequentunbiased genome sequencing to identify prokaryotic virusesassociated with Apis mellifera. Interestingly, bacteriophagesfound in honey bees show a high diversity and span differentviral taxa. This diversity sharply contrasts with the state-of-the-art knowledge on the relatively simple bee bacterial micro-biome. The identification of multiple auxiliary metabolic genessuggests that these bacteriophages possess the coding poten-tial to intervene in essential microbial pathways related tohealth and possibly also to disease. This study sheds light on aneglected part of the bee microbiota and opens avenues of invivo research on the interaction of bacteriophages with theirbacterial host, which likely has strongly underappreciatedconsequences on bee health.

Author contributions: W.D., D.C.d.G., and J.M. designed research; W.D., L.B., C.K.Y., andP.M. performed research; W.D. and L.B. analyzed data; and W.D. and J.M. wrotethe paper.

The authors declare no competing interest.

This article is a PNAS Direct Submission.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

Data deposition: Retrieved prokaryotic viral sequences larger than 5 kb were deposited inthe National Center for Biotechnology Information (NCBI) GenBank database (accessionnumbers available in SI Appendix, Table S18, available on GitHub). Raw reads were de-posited in NCBI’s Sequence Read Archive (SRA) database under accession no.PRJNA579886 (SRA accession numbers are also available in SI Appendix, Table S18, avail-able on GitHub). Analysis notebooks are available at GitHub (https://github.com/matthijnssenslab/beevir). All intermediate results files and outputs that were gener-ated, as well as the fasta sequences for nucleotides and proteins, are also availablethrough the GitHub repository.

See online for related content such as Commentaries.1To whom correspondence may be addressed. Email: [email protected] [email protected].

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1921859117/-/DCSupplemental.

First published April 27, 2020.

www.pnas.org/cgi/doi/10.1073/pnas.1921859117 PNAS | May 12, 2020 | vol. 117 | no. 19 | 10511–10519

MICRO

BIOLO

GY

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0

Page 2: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

much less divergent than its human counterpart led to the pro-posal to use honey bees as model systems for microbiota researchand furthermore as a useful tool in studying the evolution andecology of host–microbe interactions (13). Despite the recentadvances in the knowledge of the honey-bee gut metagenome,the work done on honey-bee–associated bacteriophages has beenbased on only a few isolates and thus remains biased (14, 15). Itis often postulated that (prokaryotic) viruses represent the mostprevalent biological units worldwide and execute essential roleswithin their respective ecosystems. For example, bacteriophagesplay a significant role in carbon, nitrogen, and phosphorous cy-cling in the oceans (16) and are implied to influence soil ecology(17). The presence of auxiliary metabolic genes (AMGs: heredefined as genes present in bacteriophages, but originating frombacteria with the potential to modulate microbial metabolism)within bacteriophages, and the recent discovery of a communi-cation systems resulting in lysis and lysogeny decisions (18) re-flect the important influence that these viruses play in theirputative hosts and ecosystem equilibria in general. In humans,bacteriophages have been used as alternatives for antibiotics (19)and even proposed to be used as biomarkers for numerousconditions (20).The multitude of functions that prokaryotic viruses can exert

within their biosphere, combined with the fact that the honey-bee bacterial microbiome plays a crucial role in bee health, im-plies that the viral microbiome could play an important role inbee homeostasis as well. In this work we present an initialcharacterization of the prokaryotic viral microbiome associatedwith honey bees derived from healthy and weakened coloniesusing viral-like particle enrichment strategies combined withshort read Illumina sequencing.

ResultsProkaryotic Virus Identification through Next-Generation Sequencing.Samples comprising 300 different colonies of Flemish honeybees, collected in the framework of the EpiloBEE study (21) (SIAppendix, Fig. S1), were enriched for viruses (both DNA andRNA viruses) according to the NetoVIR protocol (22) and se-quenced. These samples were initially selected to represent theFlemish population of honey bees as well as possible. In total,102 pools containing samples from hives that were comparable(derived from healthy or weak colonies) and matched geo-graphically and by subspecies as well as possible were analyzed(SI Appendix, Table S14, available on GitHub). Two bees fromthree colonies were pooled together, except for the last threepools, which contained two bees from one colony. Each pool wasassigned 5 million 150-bp end reads and were sequenced usingthe Illumina NextSEQ platform. This approach yielded a total of686,940,647 reads, with a median of 5,798,403 reads per pool(minimum: 2,096,600 reads; maximum: 26,307,071 reads). Afterde novo assembling the separate libraries, the resulting contigswere collapsed on 95% nucleotide identity over a coverage of80%, and putative prokaryotic viral sequences were identifiedusing VIRSorter (23) and a lowest-common-ancestor approachusing DIAMOND (24). Eukaryotic viruses were omitted by usingthe virome decontamination mode in VIRSorter and by manu-ally parsing the DIAMOND output. These approaches allowedthe identification of 4,842 nonredundant putative prokaryoticviral contigs with a minimum length of 500 bp. Of these contigs,20 were predicted to be circular (and thus complete genomes)(SI Appendix, Fig. S2). Of these 20 complete genomes, 11 couldbe assigned to known bacteriophage families (Microviridae,Siphoviridae, Myoviridae, and Podoviridae), and 7 could beassigned to a bacterial host (Bifidobacterium, Bartonella, Lacto-bacillus, Hafnia, and Pluralibacter) (see below). Species accu-mulation curves (assuming that the collapsed contigs reflectdistinct viral species) reveal no plateau being reached, imply-ing that, despite the large sampling effort and viral particle

enrichment, prokaryotic viral sequence space was not fully pro-bed (Fig. 1A). This observation is also reflected by the strongcorrelation between contig length and contig tpmean coverage(the average number of reads overlapping each base after re-moving the 10% most- and least-covered bases) (Spearmancorrelation coefficient = 0.74, P value < 1.10−4) (SI Appendix,Fig. S3). Reads from individual pools were aligned back to thecontig representatives, and the presence of every contig in a poolwas evaluated (presence being defined as tpmean coverage >10). Most contigs larger than 5 kb were shared between less thanfive pools, and 20 contigs were shared between more than fivepools (Fig. 1B). Pairwise comparison of the contig sharing be-tween pools is reflected in a network that represents 70 of the102 pools sequenced (Fig. 1C). The maximum number of contigsshared between two pools was 15. No clear clustering patternscould be observed when applying the Markov Cluster Algorithm(MCL) (25) or k-means clustering. The maximal clique observedin the network contained 21 pools. Next, the dimensionality ofthe coverage matrix was reduced using principal coordinateanalysis (PCoA), and the clustering patterns for sample status(health vs. diseased), sampling year (2012 vs. 2013), and location(Belgian provinces) were tested using the Adonis test. No sig-nificant effect was observed for sample status, but both locationand sampling year were significant, albeit with a low R-squaredvalue (SI Appendix, Fig. S4). Retrieved putative prokaryotic viruscontigs show a wide range of guanine–cytosine (GC) percent-ages, ranging from roughly 25 to 70% (Fig. 1D). After decoratingthe contigs with prokaryotic virus orthologous groups [pVOGs(26)], roughly half the contigs (2,346 contigs, or 48.5%) had apVOG vs. open reading frame (ORF) ratio larger than 50%(Fig. 1E). Some of the contigs that fell below this ratio weremore than 10 kb in size, implying that a high amount of putativeviral proteins were not represented in the pVOG database.These results suggest that a large number of retrieved viral genesare not represented in the pVOG database. When plotting theaverage “Virusness” (frequency of a pVOG being present inviruses versus the frequency of a pVOG being present in bac-teria) of the annotated pVOGs within a contig, it was shown thata slight majority of the contigs fell above 0.5 (SI Appendix, Fig.S5). The bimodal distribution reflects that a large number ofcontigs show a clear viral signal (average Virusness close to 1),while the enrichment of contigs with an average Virusness closeto zero is a consequence of contigs carrying no detectable pVOGat all (orange dots in SI Appendix, Fig. S5).

Host Assignment of Prokaryotic Viral Contigs. Putative bacterio-phage genomic sequences can be linked to their specific host bytaking advantage of the CRISPR-spacer sequences that theyencode and by transfer RNA (tRNA) similarity. We constructeda bee-specific gut bacterial microbiome dataset by collating dataavailable on IMG/M and from Ellegaard et al. (27) (SI Appendix,Table S15). This collated dataset includes bacterial sequencesfrom six different genera, including Lactobacillus, Bifidobacte-rium, Commensalibacter, Gilliamella, Snodgrassella, and Fri-schella. To minimize the possibility that any of the putativeprokaryotic viral contigs were of bacterial origin, the codingdensity and frequency of strand shift were calculated for both thebacterial contig set and the bacteriophage contig set (Fig. 2A).Both these parameters were significantly different between bothsets. On average, the coding density (defined as number ofpredicted genes/kilobase) was higher in the virus dataset (2.50)than the bacterial dataset (1.07) (Mann–Whitney U test, Pvalue = 5.10−164). The average frequency of strand shift (defined asthe frequency that two neighboring genes start in differentframes) was higher in the bacterial dataset (0.84) than in the viraldataset (0.63) (Mann–Whitney U test, P value = 4.10−87). Theseobservations, together with the fact that no single copy bacterialmarker genes [as defined by Lee et al. (28)] could be identified in

10512 | www.pnas.org/cgi/doi/10.1073/pnas.1921859117 Deboutte et al.

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0

Page 3: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

the putative prokaryotic viral contig set, suggest that the viralcontig set contains no or very little bacterial contamination. Intotal, 76 putative bacteriophage contigs could be linked to spe-cific bacteria using these approaches. These contigs were withinthe length range of 1 to 107 kb, and four were predicted to be

circular (and thus depict full-length genomes). Of these 76contigs, 32 could be assigned to the genus Lactobacillus, 17 to thegenus Gilliamella, and 27 to the genus Bifidobacterium. No viralcontigs could be linked to the Frischella, Snodgrassella, andCommensalibacter group of bacteria since these bacterial genomes

Fig. 1. Bee-associated prokaryotic viruses display a high interindividual diversity and contain a large number of unknown viral proteins. (A) Species accu-mulation curves as a function of the number of pools sequenced. Vertical lines indicate SDs based on 100 permutations. (B) Swarm plot reflecting putativeviral contigs larger than 5 kb that were present in one sample or more (140 in total). Presence is defined as a coverage >10. A dot represents a single contig.The box shows the three quartile values, and the whiskers extend to 1.5 interquartile ranges of the lower and upper quartile. All 140 dots are drawn in theplot. (C) Edge-weighted spring-embedded layout network depicting the samples as nodes and edges as the number of contigs shared between them. Edgethickness reflects the number of contigs. Green nodes depict pools derived from healthy colonies; red nodes depict pools derived from weak colonies. Edgethickness ranges from 1 to 15. (D) GC percentage of all representative putative viral contigs as a function of their log10-transformed coverage in the pool ofwhich the representative was derived. Log10-transformed length is indicated by color intensity. (E) Number of pVOGs found back in the putative viral contigs,normalized by the amount of predicted ORFs as a function of their log10-transformed coverage. Log10-transformed length is indicated by color intensity.

Deboutte et al. PNAS | May 12, 2020 | vol. 117 | no. 19 | 10513

MICRO

BIOLO

GY

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0

Page 4: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

did not contain any detectible CRISPR array. However, onetRNA hit was found against the genus Frischella (Fig. 2B). Themajority of host-called viral contigs were linked to a single bac-terium, but 17 of them displayed CRISPR-spacer hits againstmore than one bacterial species. One putative viral contig couldeven be linked to five different bacteria, but none of the host-linked viral contigs could be assigned to more than one genus,suggesting a restricted host range. Only five of the putative viralcontigs contained a tRNA signature that could be linked to spe-cific bacteria. Two of those contigs gave hits against nearly all ofthe Bifidobacterium or Lactobacillus species included in thisanalysis. One of those contigs also gave a hit to the only Frischellaspecies included, although there was no CRISPR-spacer evidencefound to confirm this. Since bees sample the environment, itcannot be excluded that some of the retrieved viral sequences

reflect environmental bacteriophages rather than true bee-gutviruses. To this extent, an additional CRISPR-spacer search wasran by using the spacers present in the CRISPR database(CRISPRdb) (29). These results confirm 19 of the 76 previous hitsagainst the bee-gut–specific bacteria. Furthermore, 50 additionalhits were found, of which 32 were for bacterial genera present inthe bee gut (6 Lactobacillus hits, 3 Bifidobacterium hits, 18 Barto-nella hits, 5 Gilliomella hits). The 18 remaining putative hostsidentified potentially reflect environmental bacteria (SI Appendix,Table S19, available on GitHub).

Classification of Prokaryotic Viral Contigs. In an attempt to classifythe newly discovered sequences, we ran vConTACT2 (30) on theputative prokaryotic viral sequences retrieved, using the Pro-karyotic viral REFSEQ 88 database. This method uses gene-sharing

Fig. 2. Retrieved prokaryotic viruses display a significant difference in genomic variables and infect a wide range of known bee-gut bacteria. (A) Frequencyof strand shift in function of coding density (number of ORFs per kilobase). Data from the bacterial dataset are indicated in blue; data from the viral datasetare indicated in orange. Boxplots for individual parameters are also denoted, and asterisks designate significance (Mann–Whitney U test; P value for codingdensity = 5.10−164; P value for strand shift frequency = 4.10−87). The box shows the three quartile values, and the whiskers extend to 1.5 interquartile rangesof the lower and upper quartile. Dots independently drawn fall outside of this range. (B) Maximum-likelihood phylogenetic tree for bacterial sequencesincluded in the host-calling effort. Gray integers indicate bootstrap values. The tree is colored according to bacterial genera. Number of contigs linked to aspecific bacterial species are indicated by the stacked horizontal bar plots (CRISPR-spacer counts and tRNA similarity). Shades of gray indicate the number ofspecific bacterial species that gave hits to a single contig (CRISPR spacers) or indicate a specific viral contig (tRNA similarity). Single contigs displaying CRISPR-spacer hits to multiple bacteria are indicated with colored tax links between the tips. A single color corresponds to a single contig.

10514 | www.pnas.org/cgi/doi/10.1073/pnas.1921859117 Deboutte et al.

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0

Page 5: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

networks to taxonomically assign prokaryotic viruses solelybased on their sequence. The algorithm classifies sequenceseither as “singletons” (no shared gene content), “outliers”(weakly connected with a cluster of sequences), or as part of acluster (30). Of the 4,842 nonredundant prokaryotic viral contigs(>500 bp), 3,010 were singletons, 582 were outliers, 181 showedstrong overlap between more than one established cluster (notallowing for their unambiguous classification), and 1,034 couldbe unambiguously clustered (Fig. 3A). The clustered contigs arerepresented by 403 viral genome clusters (which are said to beequivalent to the genus taxonomic level). Of these viral genomeclusters, 368 clusters contained no REFSEQ sequences at all.The remaining 35 clusters (representing 85 contigs) were mostlyrelated to the families Siphoviridae and Myoviridae, although thefamilies Podoviridae, Inoviridae, Microviridae, and Cystoviridaewere also represented (Fig. 3 B and C and SI Appendix, TableS16). The resulting network of clustered genome sequences re-veals the newly discovered sequences as widely dispersedthroughout known REFSEQ sequences (Fig. 3D), despite thefact that this network reflects only about 20% of the recoveredsequences (the remaining sequences are singletons). Of the 537viral contigs that were larger than 5 kb, 71 (13.2%) were sin-gletons, 126 (23.5%) were outliers, 67 (12.5%) showed toomuch overlap to be unambiguously classified, and 273 (50.1%)could be unambiguously clustered. Of the clustered sequences,73 could be assigned to a viral family. Although the relativeamount of assigned contigs versus the other categories washigher in the dataset with large viral contigs (SI Appendix, Fig.S6A) compared to the small contigs (Fig. 3A), the assignment todifferent viral families remained comparable to the full dataset(Fig. 3 B and C vs. SI Appendix, Fig. S6 B and C). The networkprojection also revealed that, despite the loss of a substan-tial number of small clusters, the viral diversity remained

widespread. The relative increase in clustered viral sequencesseen only when looking at contigs above 5 kb, combined with theobservation that most of the singleton contigs are shorter inlength than the other groups (SI Appendix, Fig. S7), reflect that ashorter sequence length hampers the classification process in thisdataset. Because viruses lack universal marker genes, and veryfew of the recovered proteins could be clustered together (seebelow), phylogenetic trees were drawn for the five largest proteinclusters (PCs), as identified by vConTACT2. These five largestprotein clusters contained reference proteins annotated as “Ri-bonucleotide reductase” (PC1), “Endonuclease” (PC2), “ssDNAbinding protein” (PC3), “Endonuclease” (PC4), and “Thymidi-late synthase” (PC5). The number of proteins (identified in thisstudy) in each protein cluster was highly variable. PC1 contained27 identified proteins (400 proteins in total), PC2 contained 70identified proteins (389 proteins in total), PC3 contained83 identified proteins (331 proteins in total), PC4 contained 34identified proteins (302 proteins in total), and PC5 contained 8identified proteins (283 proteins in total). The identified se-quences do not fall into distinct clades and seem to be dispersedover the entire phylogenetic spectrum of their respective trees(SI Appendix, Fig. S8A). Given the lack of large protein clusterscontaining many of the identified sequences, the branch lengthsin between all tips on the protein cluster trees were calculatedand linked to the minimum path length between the corre-sponding genomes in the vConTACT2 network. These minimumpath lengths were calculated using the Bellman–Ford algorithm.Both metrics were significantly positively correlated (Spearmanrank correlation coefficient = 0.43; P value = 0.0), althoughwhen breaking up between the types (bee-associated viral contig,reference or bee-associated viral contig, and references com-bined) the correlation coefficients ranged from 0.44 to 0.82 (SIAppendix, Fig. S8B). These results imply that the distances

Fig. 3. The vast majority of retrieved prokaryotic viruses cannot be classified confidently. (A) Counts indicating the classification status of the putative viralcontigs using vConTACT2. “Clustered Assigned” denotes retrieved contigs falling into clusters containing reference sequences; “Clustered Not-Assigned”denotes retrieved contigs falling in clusters without reference sequences. (B) Number of clusters that contained both confidently clustered large contigs andreference sequences. (C) Number of clusters that contained both confidently clustered contigs and reference sequences. (D) Scalable force directed placementlayout genome network containing the retrieved clustered putative prokaryotic viruses (red) and the viral family of reference sequences (other colors). Themost prevalent viral families are indicated in yellow (Myoviridae), green (Podoviridae), and orange (Siphoviridae).

Deboutte et al. PNAS | May 12, 2020 | vol. 117 | no. 19 | 10515

MICRO

BIOLO

GY

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0

Page 6: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

between connected nodes within the vConTACT2 network canalso be interpreted (to some degree) as phylogenetic distances.

Functional Potential and Selection Signatures of Honey-Bee–AssociatedProkaryotic Viral Genes. To gain insights into the functional po-tential encoded by the retrieved bacteriophages, InterProScan(31) and eggNOG-mapper (32, 33) were utilized for domain an-notations. To reduce computational burden, protein sequencesfrom predicted genes were collapsed on 50% amino acid identitybefore analysis. This procedure reduced the amount of putativeproteins from 24,420 to 18,747, although the vast majority ofclusters comprised less than five protein sequences (Fig. 4A and SIAppendix, Fig. S9A). In an attempt to identify AMGs, we blastedthe viral protein sequences against the proteins encoded in thesame bee-gut bacterial microbiome dataset used for host calling(SI Appendix, Table S15). Prior to blasting, the bacterial proteindataset was clustered using the same parameters as for the viralproteins. A viral protein was considered a genuine AMG when thealignment had an e-value smaller than 1e-5. Of the 18,747 viralprotein clusters, 2,744 were identified as AMGs (Fig. 4B). Toestimate the proportion of the identified AMGs originating fromprophage regions, PHASTER (34) was run on the bacterial con-tigs (SI Appendix, Table S17). The bacterial proteins found in theAMG search were evaluated whether they fell in these regions ornot. In total, 45 prophage regions were discovered, and 95 of the1,506 (roughly 6%) bacterial counterparts of the identified AMGsfell inside these regions. Roughly 65% of the cluster representa-tives of the viral proteins (12,286) showed significant hits againstthe EggNOG database, the different databases used by Inter-ProScan, or both (Fig. 4A). Of all of the Clusters of OrthologousGroups (COG) categories that the viral proteins could be assignedto, category S had the highest number of clusters assigned(“Function Unknown,” 4,293 clusters), followed by typical viralreplication signatures (“Replication, recombination, and repair”[468 clusters], “Cell wall/membrane/envelope biogenesis” [212clusters], and “Transcription” [166 clusters]) (SI Appendix, Fig.S9B). A complete enumeration of Gene Ontology (GO) acces-sions plotted into treemaps using REVIGO (35) reveals similarcharacteristics as the COG categories and a general lack of in-

depth annotation of the retrieved viral proteins (SI Appendix, Fig.S10). In an attempt to further elucidate functional potential, theGO accessions were projected onto the pathways of which they area part of using Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway mapper (36) (Fig. 4C). Some of the retrievedpathways reflect functions that could influence bacteria directlyand are involved in biofilm formation, quorum sensing, and bac-terial chemotaxis. Other represented pathways reflect a wide rangeof basic metabolic functions, including lipid-, carbohydrate-, nucle-otide-, and amino acid metabolism. Interestingly, also xenobioticdegradation, glycan biosynthesis, and terpenoid and polyketidemetabolism were represented. The bacterial annotations for thepreviously defined AMGs were also projected into pathways, andoverlapping pathway annotations were identified (Fig. 4C, red-outlined rectangles). Interestingly, nearly all of the identified path-ways in the viral protein clusters were represented by the bacterialannotations for the AMG set. This observation further confirms theidea that the prokaryotic viral contigs contain the coding potentialto influence bacterial metabolic state and homeostasis.In an attempt to further characterize the signatures of sec-

ondary metabolites, as well as the other pathway functions thatreflect the role of secondary metabolites, antiSMASH (37) wasrun. In total, four gene clusters were identified, all containingone gene with a bacteriocin signature (SI Appendix, Fig. S11).The four genes containing the bacteriocin signature had aminoacid similarities with GenBank proteins ranging from 34 to 97%.Of the gene clusters identified, 13 to 53% of neighboring genesshowed similarity to other genes represented in the antiSMASHdatabase. Finally, an attempt was made to characterize selectionsignatures within the encoded genes. To achieve this, single-nucleotide polymorphisms (SNPs) were called per representa-tive contig present in every pool, and nonsynonymous vs. syn-onymous substitution rates were calculated using SNPgenie (38).The majority of genes had a πN/πS ratio lower than 1, but 52proteins revealed a positive selection signature in at least onepool (SI Appendix, Fig. S12). Of those 52 proteins, 11 werefunctionally annotated. These functions included mostly capsidand tail domains, but also transglycosylase and endopeptidasefunctions (see SI Appendix, Table S18, available on GitHub).

43

12

11 11

10

10

8

7

6 6

55

44

4

4

4

4

3

3 3

2 2 2

2

2 2 2

2

2

222

22

Tota

l Cou

nt

Viruses Bacteria

A

B

C

Fig. 4. Functional annotation reveals a large metabolic overlap between bacterial and prokaryotic virus proteins. (A) Number of viral and bacterial proteinsincluded in the analysis (blue), number of clusters remaining after collapsing on 50% AA identity (orange), and amount of protein clusters with a hit througheither eggnog-mapper or InterProScan. (B) Venn diagram depicting the number of protein clusters and the number of putative AMGs identified. (C)Functional network depicting the KEGG pathways represented in the viral protein clusters. Edge weight reflects the number of GO accessions associated witheach pathway, ranging from 1 to 43. All edge weights larger than 1 are indicated with a number on the terminal node. Pathways outlined with a redrectangle depict functional pathways found encoded by viral genes that are reflected by the bacterial representatives of the AMGs as well.

10516 | www.pnas.org/cgi/doi/10.1073/pnas.1921859117 Deboutte et al.

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0

Page 7: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

DiscussionThis study provides an unbiased look at the prokaryotic viralcommunities associated with honey bees. The fact that the spe-cies accumulation curve did not reach a plateau phase impliesthat the full prokaryotic virus diversity has not been probed,despite the large sampling effort in combination with viral-likeparticle enrichment. The fact that some pools contained a highnumber of eukaryotic viral reads [also replicating in the honey-bee gut (39)] might have resulted in a suboptimal probing of thebacteriophages in these pools. The retrieved prokaryotic viralsequences display a large diversity, reflected by the number ofannotated genes and the number of pVOGs that could be de-scribed. The general lack of classification of the contigs, and thefact that the majority of encoded genes could not have a functiondescribed, reflects this observation. Furthermore, the non-flattened species accumulation curve in combination with thestrong positive correlation between coverage and length impliesthat many of the contigs described are fragmented genomes. Incontrast to the gut bacterial microbiome, the prokaryotic viralcommunities display a high level of individualism where very fewof the sequences are found back in many pools. Whether thisobservation could be a consequence of the sampling effort, ortruly reflects the lack of a core gut virome, remains enigmatic.No significant correlation could be described between the bac-teriophage communities derived from healthy and weak bees,but both location and sampling year had a significant effect(albeit with a low R-squared value). This observation reinforcesthe idea of high individuality and implies a dynamic nature of thebacteriophage communities. Host assignment of the viral sequencesresulted in the assignment of only 1% of the contigs to their re-spective bacteria. Because both the methods (CRISPR-spacer se-quences and tRNA similarity) used for host assignment do not havea very high sensitivity, it is likely that the number of viral sequencesinfecting members of the core gut bacterial microbiome is muchhigher. On the other hand, since entire bees were used for viraldiscovery, and not dissected guts, it cannot be excluded that somesequences represent soil- or plant-associated phage communities.The overlapping results from the CRISPRdb search and the beebacterium-specific search revealed that the majority of sequencesare indeed true gut-specific bacteriophages, but that environmental“contamination” cannot be ruled out completely. The observationthat nearly all of the members of the core gut bacterial microbiomenow have viral sequences associated with them reinforces the ideathat at least a part of the viral community described here is trulypart of the bee-gut virome. Since viral-like particle enrichmenttechniques never perform perfectly, the question arises that some ofthe retrieved sequences could have originated from bacteria. Sincebacteriophages can also be integrated into bacterial genomes asprophages, the identification process can be prone to errors. Toensure that the sequences retrieved in this study originate fromviruses, and not from bacterial contamination, coding densities andstrand shift were calculated and differed significantly from thebacterial dataset. Both the parameters were chosen since a newbacteriophage identification algorithm identified these as the mostinformative in the discrimination between viruses and bacteria (40).Additionally, no bacterial marker genes could be identified withAnvi’o (41), using the single-copy gene bacterial Hidden MarkovModel (HMM) profiles defined by Lee et al. (28). Furthermore, alarge number of the GO terms associated with the putative viralproteins contain virus-specific signatures, and the overlap betweenbacterial—and viral—protein clusters (defined here as AMGs) re-mains relatively small. One would expect a much larger overlapbetween these cluster sets if contaminated by bacterial sequences.Taken together, this evidence supports the idea that very few tonone of the sequences used in this study are of bacterial origin.Classification of the putative viral sequences resulted in roughly20% of all of the sequences being clustered into 403 putative viral

genera (genome clusters), but many of the clusters contained onlytwo sequences or did not contain a reference genome from anestablished (International Committee on Taxonomy of Viruses[ICTV]-recognized) virus genus or family. The fact that roughly50% (273 of 537) of contigs larger than 5 kb could be clusteredreflects that a short sequence length can hamper the ability toclassify these sequences. Since the accuracy of the vConTACT2algorithm is estimated at more than 95% (based on ICTV genera)(30), the confidence in the classification performance of the largesequences is high. The proportion of clustered sequences (roughly20% of all sequences and roughly 50% of sequences larger than 5kb) is higher than in a similar study on permafrost viruses (17%sequences larger than 10 kb clustered) (42) and in human gutdatasets (18% of sequences larger than 10 kb clustered) (43). De-spite the relatively high proportion of clustered sequences, theclassification results reflect the strikingly large diversity of pro-karyotic viral communities associated with bees and how much ofthe viral diversity still remains untapped. This is reinforced by thefact that only very few putative bee-associated phage sequenceswere present in the largest protein clusters created for classification.The same patterns of diversity are also reflected in the proteinannotation. Most of the predicted protein clusters remain un-annotated. Of the proteins predicted to be under strong directionalselection, only 20% could be assigned a function. Since it is prob-able that these proteins fulfill cornerstone functions in viral repli-cation or important functions in the viral life cycle, the lack ofannotation of these proteins reflects how little is known about theseprocesses. Of the proteins that could be annotated in a meaningfulway, the vast majority was specific for nucleic acid processing/me-tabolism and are most often derived from polymerase sequences,which are often the easiest to identify with very specific domains.Represented pathways contained a plethora of metabolic functions,including carbohydrate, protein, and lipid processing pathways.Many of these functions are also represented by the bacterialcounterpart of the bee microbiome, implying that the bee-gutvirome contains the coding potential for a vast range of metabolicfunctions and could directly intervene within the gut ecosystem. Thebest-represented pathways, such as genetic information processingand nucleotide metabolism, most likely reflect the rewiring strategyof phages to tune the bacterial cell metabolism toward virus repli-cation, which has been described before (44). The lipid and nu-cleotide metabolism pathways most likely point in the samedirection. The presence of environmental information-processingpathways suggests that some of the retrieved bacteriophages havethe potential to probe the environment. It has been shown that thetwo-component system can be exploited by viruses to provide anenvironmental sensor system (45). The presence of more basalmetabolic pathways, such as energy metabolism and carbohydratemetabolism, implies that also in the bee microbiome bacteriophagescan modulate the metabolic state rather than hijack the microbialcell and deplete it for resources, as has been shown before (46).Biofilm formation, quorum sensing, and chemotaxis pathways werealso represented within the retrieved viral communities, suggestingthe potential of the viral communities to interfere in microbialprocesses on the bacterial population level. The presence of met-abolic pathways involved in secondary metabolites and even ter-penoids and polyketides raised the question if any other genes couldbe involved in bacteria–bacteria interactions. The discovery of fourbacteriocin gene clusters implies that these bacteriophages do notdirectly influence only their own host and their metabolism butencode the potential to exert an effect on other bacteria in the sameecosystem throughout their host. Some of the identified bacteriocingenes were rather divergent, and very few of the neighboring geneswithin the cluster gave any hit at all. These findings imply that, whilethe essential host–microbe interactions in honey bees are known,the virus–bacteria interactions in the bee gut are highly intertwined.Finally, we can highlight the potential role that the prokaryotic viral

Deboutte et al. PNAS | May 12, 2020 | vol. 117 | no. 19 | 10517

MICRO

BIOLO

GY

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0

Page 8: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

community can play in the gut and microbial metabolism and thusindirectly influence bee development, health, and homeostasis.

Materials and MethodsLibrary Preparation and Next Generation Sequencing. Samples were takenfrom the Flemish section of the EpiloBEE project from both sampling years(autumn 2012 and 2013) from different hives. In the framework of this study,colony health was determined retrospectively by assessing which coloniessurvived the winter or not. Two honey bees per colony were taken andhomogenized for 1 min in phosphate-buffered saline, using ceramic beads(Precellys, Bertin Technologies) at 4,000 hz using a tissue homogenizer(Minilys, Bertin Technologies). Homogenates from three colonies werepooled together using equal volumes for feasibility reasons. Samples werepooled based on status (weak or healthy colonies), subspecies, and location(see SI Appendix, Table S14, available on GitHub). After pooling, the ho-mogenates were prepared for sequencing using the NetoVIR protocol (22).Briefly, homogenates were centrifuged (17,000 × g for 3 min), filtered (0.8μm), and treated with a mixture of nucleases (Benzonase, Novagen) andmicrococcal nuclease (New England Biolabs).

Next, nucleic acids (both DNA and RNA) were extracted using the RNA viralextraction kit (Qiagen), reverse-transcribed, and amplified using the WTA2 kit(Sigma-Aldrich) and prepared for sequencing using the Nextera XT kit (Illu-mina). Libraries were quantified using a qubit fluorometer, and insert sizeswere asserted using a bioanalyzer (Agilent). Only libraries that had molaritiesabove 4 nM and an average size of 300 bp or more were considered for se-quencing. Paired-end sequencing was performed using the Illumina NextSEQplatform, assigning 5 million clusters per pool (10 million reads with a baselength of 150). In total, 102 pools were sequenced (representing 300 colonies).

Read Processing, Bacteriophage Contig Identification, and Classification. Readswere clipped using Trimmomatic (version 0.38) (47), removing WTA2 andNextera XT adapters and the leading 19 bases, and tailing 15 bases werecropped. Reads were trimmed using a sliding window of 4 with a PHREDscore cutoff of 20 with a minimum size of 50 bp. Trimmed reads were as-sembled using SPAdes (version 3.12.0) (48) on metagenomic setting with kmersizes of 21, 33, 55, and 77. Resulting contigs larger than 500 bpwere clustered on95% nucleotide identity over a coverage of 80% using ClusterGenomes (https://bitbucket.org/MAVERICLab/docker-clustergenomes). Putative prokaryotic viralsequences were identified using VIRSorter (version 1.05) (23) and by includingsequences that had a lowest-common ancestor [as assigned by KronaTools(version 2.7.1) (49)] to any prokaryotic viral family, after alignment with thenonredundant protein database (downloaded September 30, 2018) from Na-tional Center for Biotechnology Information (NCBI). Reads were mapped back tothese representative putative prokaryotic viral sequences using bwa-mem (50),and the resulting bam files were postprocessed using BamM (https://github.com/Ecogenomics/BamM), allowing only alignments with 95% nucleotide identityover 90% of the length. Coverages were calculated from these postprocessedbam files with BamM using the tpmean counting option. Dimension reductionwas performed on the coverage matrix using the PCoA function implemented inthe ape package in R (51) and formally tested using the adonis test implementedin the vegan package in R (52). Predicted viral sequences were classified usingthe BLASTP mode incorporated in vConTACT2 (30), using the Prokaryotic ViralRefsEq. 88 database MCL (25) for protein clustering and ClusterONE (53) forgenome clustering. The resulting vConTACT2 network was processed using thegraph-tool library (54) in python. Phylogenetic trees for the five biggest proteinclusters were created by aligning the protein sequences with MAFFT (L-INS-isetting) and trimming the resulting alignment with trimAL (version 1.2) usingthe gappyout preset. Trees were subsequently created with RaxML (version8.2.12) (55) using automatic model selection. Statistics from the phylogenetictrees were processed using the ete3 toolkit (56), implemented in python.

Host Calling. Bacterial sequences were retrieved from IMG/M database (JGI) byusing the query “honey bee” and complemented with the sequences fromEllegaard et al. (27) (SI Appendix, Table S15). CRISPR spacers were predictedfrom these bacterial sequences using MINCED (version 0.2.0) (57). ThisCRISPR-spacer collection was subsequently blasted on the nucleotide level

against a database containing the retrieved bacteriophage sequences, usingthe blastN algorithm with the additional settings -ungapped and -perc_identity 100. These settings are more conservative than usual (58), but wereselected to achieve the highest possible specificity at the expense of sensi-tivity. In parallel, tRNA genes were predicted from the retrieved bacterio-phage sequences using Aragorn (version 1.2.38) (59) and blasted against thebacterial sequences using the blastN algorithm with an e-value cutoff of 1e-5. To estimate how many of the retrieved viral sequences are derived fromthe environment rather than reflecting true bee-gut bacteriophages, anadditional analysis using CRISPR spacers from the CRISPRdb (29) was runusing the same blastN parameters as before. A concatenated proteinalignment for the bacterial sequences was created with Anvi’o (version 5)(41), using the “phylogenomics” workflow. The resulting alignment wastrimmed using trimAl (version 1.2), using the gappyout preset. Proteinmodels were calculated with ProtTest3 (version 3.4.2) (60), and the phylog-eny was created using RAxML (version 8.2.12) (55) under the LG + I + G + Fmodel. The resulting tree was visualized using ggtree (version 3.10) (61).

Functional Analysis. Putative viral genes were predicted with prodigal, usingthe bacterial genetic code (11). Resulting proteins were clustered using CD-HIT (version 4.8.1) (62) with a threshold of 50% amino acid (AA) similarity.Bacterial genes (from the aforementioned bacterial dataset) were predictedand clustered via the same pipeline. Representative protein sequences wereanalyzed using InterProScan (version 5.30–69.0) (31) and eggNOG-mapper(version 1.0.2) (32). Downstream analysis was performed with REVIGO (35)and the KEGG Pathway Maps (36). Prophage regions were identified in thebacterial contigs using PHASTER (34). Antimicrobial functions were extractedfrom the InterProScan and eggNOG-mapper output and complementedusing antiSMASH (version 5.0.0) (37) output. SNPs were called using free-bayes (version 1.2.0) (63) on the filtered bam files using flags -X, -u and -p1.Resulting VCF files were filtered for a quality threshold of 20. SNP statisticswere subsequently calculated using SNPgenie (38).

Statistics. The species accumulation curve was calculated using the “specaccum”

function within vegan, R version 3.5.3 (52), using all 102 sequenced pools. Thedifference between coding density and strand shift frequency, as well as thedifference in contig length between clustered contigs and singleton contigs wascalculated with Python using the two-tailed Mann–Whitney U test implementedin the SciPy library. For the coding density and strand shift frequency analysis,24,420 predicted viral genes were used and 58,704 predicted bacterial genes. Forthe clustered-contig versus singleton-contig length difference, 1,034 clusteredcontigs were used and 3,010 singleton contigs were used. Correlations werecalculated with the Spearman’s rank-order correlation implemented in the SciPylibrary. For the correlation between coverage and length, 4,842 putative viralcontigs were used. Correlations between branch length distance and node dis-tances in the network were calculated using 224,714 pairs.

Data Accessibility. Retrieved prokaryotic viral sequences larger than 5 kbweresubmitted to NCBI GenBank (accession numbers available in SI Appendix, TableS19, available on GitHub). Raw reads were deposited in NCBI’s Sequence ReadArchive (SRA) database under project accession no. PRJNA579886 (SRA acces-sion numbers are also available in SI Appendix, Table S19, available onGitHub). Analysis notebooks have been deposited on GitHub (https://github.com/Matthijnssenslab/beevir). All intermediate result files and outputs gener-ated, as well as the fasta sequences for nucleotides and proteins, are alsoavailable through the GitHub repository. A complete overview of the wetlaboratory work and data-processing pipeline is given in SI Appendix, Fig. S13.

ACKNOWLEDGMENTS. This project was funded by the Belgian Federal PublicService Health, Food Chain Safety and Environment (grant RF 16/6306ViroBee). We thank the EpiloBEE study (European Commission) for providingaccess to the samples. L.B. was funded by Research Foundation Flanders(FWO). C.K.Y was funded by the Interfaculty Council for the DevelopmentCooperation (IRO) from the KU Leuven. The computing power for this workwas provided by the Flemish Supercomputer Centre (VSC) and financed bythe FWO and the Department of Economy, Science and Inovation (EWI) ofthe Flemish government.

1. D. Vanengelsdorp, M. D. Meixner, A historical review of managed honey bee pop-ulations in Europe and the United States and the factors that may affect them.J. Invertebr. Pathol. 103, S80–S95 (2010).

2. N. Forfert et al., Parasites and pathogens of the honeybee (Apis mellifera)and their influence on Inter-Colonial Transmission. PLoS One 10, e0140337(2015).

3. A. Fünfhaus, J. Ebeling, E. Genersch, Bacterial pathogens of bees. Curr. Opin. InsectSci. 26, 89–96 (2018).

4. A. J. McMenamin, M. L. Flenniken, Recently identified bee viruses and their impact onbee pollinators. Curr. Opin. Insect Sci. 26, 120–129 (2018).

5. E. A. D. Mitchell et al., A worldwide survey of neonicotinoids in honey. Science 358,109–111 (2017).

6. K. M. Ellegaard, P. Engel, Genomic diversity landscape of the honey bee gut micro-biota. Nat. Commun. 10, 446 (2019).

7. T. Regan et al., Characterisation of the British honey bee metagenome. Nat. Commun.9, 4995 (2018).

10518 | www.pnas.org/cgi/doi/10.1073/pnas.1921859117 Deboutte et al.

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0

Page 9: Honey-bee–associated prokaryotic viral communities reveal ...Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential

8. F. J. Lee, D. B. Rusch, F. J. Stewart, H. R. Mattila, I. L. G. Newton, Saccharide breakdownand fermentation by the honey bee gut microbiome. Environ. Microbiol. 17, 796–815(2015).

9. M. H. Haydak, Honey bee nutrition. Annu. Rev. Entomol. 15, 143–156 (1970).10. L. Kešnerová et al., Disentangling metabolic functions of bacteria in the honey bee

gut. PLoS Biol. 15, e2003467 (2017).11. H. Zheng, J. E. Powell, M. I. Steele, C. Dietrich, N. A. Moran, Honeybee gut microbiota

promotes host weight gain via bacterial metabolism and hormonal signaling. Proc.Natl. Acad. Sci. U.S.A. 114, 4775–4780 (2017).

12. P. W. Maes, P. A. P. Rodrigues, R. Oliver, B. M. Mott, K. E. Anderson, Diet-related gutbacterial dysbiosis correlates with impaired development, increased mortality andNosema disease in the honeybee (Apis mellifera). Mol. Ecol. 25, 5439–5450 (2016).

13. P. Engel et al., The bee microbiome: Impact on bee health and model for evolutionand ecology of host-microbe interactions. MBio 7, e02164–e15 (2016).

14. T. S. Brady et al., Bacteriophages as an alternative to conventional antibiotic use forthe prevention or treatment of Paenibacillus larvae in honeybee hives. J. Invertebr.Pathol. 150, 94–100 (2017).

15. B. D. Merrill, J. H. Grose, D. P. Breakwell, S. H. Burnett, Characterization of Paeni-bacillus larvae bacteriophages and their genomic relationships to firmicute bacte-riophages. BMC Genomics 15, 745 (2014).

16. M. Breitbart, C. Bonnain, K. Malki, N. A. Sawaya, Phage puppet masters of the marinemicrobial realm. Nat. Microbiol. 3, 754–766 (2018).

17. G. Trubl et al., Soil viruses are underexplored players in ecosystem carbon processing.mSystems 3, e00076-18 (2018).

18. Z. Erez et al., Communication between viruses guides lysis-lysogeny decisions. Nature541, 488–493 (2017).

19. R. M. Dedrick et al., Engineered bacteriophages for treatment of a patient with adisseminated drug-resistant Mycobacterium abscessus. Nat. Med. 25, 730–733 (2019).

20. B. Bakhshinejad, S. Ghiasvand, Bacteriophages in the human gut: Our fellow travelersthroughout life and potential biomarkers of health or disease. Virus Res. 240, 47–55(2017).

21. A. Jacques et al.; EPILOBEE Consortium, A pan-European epidemiological study re-veals honey bee colony survival depends on beekeeper education and disease control.PLoS One 12, e0172591 (2017).

22. N. Conceição-Neto et al., Modular approach to customise sample preparation pro-cedures for viral metagenomics: A reproducible protocol for virome analysis. Sci. Rep.5, 16532 (2015).

23. S. Roux, F. Enault, B. L. Hurwitz, M. B. Sullivan, VirSorter: Mining viral signal frommicrobial genomic data. PeerJ 3, e985 (2015).

24. B. Buchfink, C. Xie, D. H. Huson, Fast and sensitive protein alignment using DI-AMOND. Nat. Methods 12, 59–60 (2015).

25. S. Van Dongen, Graph clustering via a discrete uncoupling process. SIAM J. MatrixAnal. Appl. 30, 121–141 (2008).

26. A. L. Grazziotin, E. V. Koonin, D. M. Kristensen, Prokaryotic Virus Orthologous Groups,Prokaryotic virus orthologous groups (pVOGs): A resource for comparative genomicsand protein family annotation. Nucleic Acids Res. 45, D491–D498 (2017).

27. K. M. Ellegaard, P. Engel, New reference genome sequences for 17 bacterial strains ofthe honey bee gut microbiota. Microbiol. Resour. Announc. 7, e00834-18 (2018).

28. M. D. Lee, GToTree: A user-friendly workflow for phylogenomics. Bioinformatics 35,4162–4164 (2019).

29. I. Grissa, G. Vergnaud, C. Pourcel, The CRISPRdb database and tools to display CRISPRsand to generate dictionaries of spacers and repeats. BMC Bioinformatics 8, 172 (2007).

30. H. Bin Jang et al., Taxonomic assignment of uncultivated prokaryotic virus genomes isenabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019).

31. A. L. Mitchell et al., InterPro in 2019: Improving coverage, classification and access toprotein sequence annotations. Nucleic Acids Res. 47, D351–D360 (2019).

32. J. Huerta-Cepas et al., Fast genome-wide functional annotation through orthologyassignment by eggNOG-mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).

33. J. Huerta-Cepas et al., eggNOG 5.0: A hierarchical, functionally and phylogeneticallyannotated orthology resource based on 5090 organisms and 2502 viruses. NucleicAcids Res. 47, D309–D314 (2019).

34. D. Arndt et al., PHASTER: A better, faster version of the PHAST phage search tool.Nucleic Acids Res. 44, W16–W21 (2016).

35. F. Supek, M. Bošnjak, N. Skunca, T. Smuc, REVIGO summarizes and visualizes long listsof gene ontology terms. PLoS One 6, e21800 (2011).

36. M. Kanehisa, S. Goto, KEGG: Kyoto encyclopedia of genes and genomes. Nucleic AcidsRes. 28, 27–30 (2000).

37. K. Blin et al., antiSMASH 5.0: Updates to the secondary metabolite genome miningpipeline. Nucleic Acids Res. 47, W81–W87 (2019).

38. C. W. Nelson, L. H. Moncla, A. L. Hughes, SNPGenie: Estimating evolutionary pa-rameters to detect natural selection using pooled next-generation sequencing data.Bioinformatics 31, 3709–3711 (2015).

39. H. F. Boncristiani, Jr, G. Di Prisco, J. S. Pettis, M. Hamilton, Y. P. Chen, Molecular ap-proaches to the analysis of deformed wing virus replication and pathogenesis in thehoney bee, Apis mellifera. Virol. J. 6, 221 (2009).

40. D. Amgarten, L. P. P. Braga, A. M. da Silva, J. C. Setubal, MARVEL, a tool for predictionof bacteriophage sequences in metagenomic bins. Front. Genet. 9, 304 (2018).

41. A. M. Eren et al., Anvi’o: An advanced analysis and visualization platform for ’omicsdata. PeerJ 3, e1319 (2015).

42. J. B. Emerson et al., Host-linked soil viral ecology along a permafrost thaw gradient.Nat. Microbiol. 3, 870–880 (2018).

43. A. C. Gregory, O. Zablocki, A. Howell, B. Bolduc, M. B. Sullivan, The human gut viromedatabase, bioRxiv:10.1101/655910 (2 July 2019).

44. H. Enav, Y. Mandel-Gutfreund, O. Béjà, Comparative metagenomic analyses revealviral-induced shifts of host metabolism towards nucleotide biosynthesis. Microbiome2, 9 (2014).

45. Q. Zeng, S. W. Chisholm, Marine viruses exploit their host’s two-component regula-tory system in response to resource limitation. Curr. Biol. 22, 124–128 (2012).

46. J. De Smet et al., High coverage metabolomics analysis reveals phage-specific alter-ations to Pseudomonas aeruginosa physiology during infection. ISME J. 10, 1823–1835(2016).

47. A. M. Bolger, M. Lohse, B. Usadel, Trimmomatic: A flexible trimmer for Illumina se-quence data. Bioinformatics 30, 2114–2120 (2014).

48. A. Bankevich et al., SPAdes: A new genome assembly algorithm and its applications tosingle-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

49. B. D. Ondov, N. H. Bergman, A. M. Phillippy, Interactive metagenomic visualization ina Web browser. BMC Bioinformatics 12, 385 (2011).

50. H. Li, R. Durbin, Fast and accurate short read alignment with Burrows-Wheelertransform. Bioinformatics 25, 1754–1760 (2009).

51. E. Paradis, K. Schliep, Ape 5.0: An environment for modern phylogenetics and evo-lutionary analyses in R. Bioinformatics 35, 526–528 (2019).

52. J. Oksanen et al, vegan: Community Ecology Package. https://cran.r-project.org/web/packages/vegan/index.html. Accessed 30 September 2019.

53. T. Nepusz, H. Yu, A. Paccanaro, Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods 9, 471–472 (2012).

54. T. P. Peixoto, The Graph-Tool Python Library (2017). https://figshare.com/articles/graph_tool/1164194. Accessed 8 October 2019.

55. A. Stamatakis, RAxML version 8: A tool for phylogenetic analysis and post-analysis oflarge phylogenies. Bioinformatics 30, 1312–1313 (2014).

56. J. Huerta-Cepas, F. Serra, P. Bork, ETE 3: Reconstruction, analysis, and visualization ofphylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).

57. C. Bland et al., CRISPR recognition tool (CRT): A tool for automatic detection ofclustered regularly interspaced palindromic repeats. BMC Bioinformatics 8, 209(2007).

58. R. A. Edwards, K. McNair, K. Faust, J. Raes, B. E. Dutilh, Computational approaches topredict bacteriophage-host relationships. FEMS Microbiol. Rev. 40, 258–272 (2016).

59. D. Laslett, B. Canback, ARAGORN, a program to detect tRNA genes and tmRNA genesin nucleotide sequences. Nucleic Acids Res. 32, 11–16 (2004).

60. D. Darriba, G. L. Taboada, R. Doallo, D. Posada, ProtTest 3: Fast selection of best-fitmodels of protein evolution. Bioinformatics 27, 1164–1165 (2011).

61. G. Yu, T. T.-Y. Lam, H. Zhu, Y. Guan, Two methods for mapping and visualizing as-sociated data on phylogeny using ggtree. Mol. Biol. Evol. 35, 3041–3043 (2018).

62. L. Fu, B. Niu, Z. Zhu, S. Wu, W. Li, CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).

63. E. Garrison, G. Marth, Haplotype-based variant detection from short-read sequencing.arXiv:1207.3907 (21 September 2019).

Deboutte et al. PNAS | May 12, 2020 | vol. 117 | no. 19 | 10519

MICRO

BIOLO

GY

Dow

nloa

ded

by g

uest

on

Aug

ust 7

, 202

0