BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is...

19
1 BARM and BalticMicrobeDB, a reference metagenome and interface to meta-omic data for the Baltic Sea Johannes Alneberg, John Sundh, Christin Bennke, Sara Beier, Daniel Lundin, Luisa W. Hugerth, Jarone Pinhassi, Veljo Kisand, Lasse Riemann, Klaus Jürgens, Matthias Labrenz, Anders F. Andersson Affiliations KTH Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Stockholm, Sweden Johannes Alneberg, Anders F. Andersson, Luisa W. Hugerth Centre for Ecology and Evolution in Microbial Model Systems, Linnaeus University, Kalmar, Sweden Daniel Lundin, Jarone Pinhassi Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden John Sundh Leibniz Institute for Baltic Sea Research, Warnemünde, Germany Christin Bennke, Sara Beier, Klaus Jürgens, Matthias Labrenz Centre for Translational Microbiome Research, Department of Molecular, Tumor and Cell Biology, Karolinska Institutet, Science for Life Laboratory, Solna, Sweden Luisa W. Hugerth University of Tartu, Institute of Technology, Tartu, Estonia Veljo Kisand Section for Marine Biological Section, Department of Biology, University of Copenhagen, Helsingør, Denmark Lasse Riemann

Transcript of BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is...

Page 1: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

1

BARM and BalticMicrobeDB, a reference metagenome and interface to meta-omic data for the Baltic Sea JohannesAlneberg,JohnSundh,ChristinBennke,SaraBeier,DanielLundin,LuisaW.Hugerth,JaronePinhassi,VeljoKisand,LasseRiemann,KlausJürgens,MatthiasLabrenz,AndersF.Andersson

Affiliations

KTHRoyalInstituteofTechnology,ScienceforLifeLaboratory,SchoolofBiotechnology,Stockholm,SwedenJohannesAlneberg,AndersF.Andersson,LuisaW.HugerthCentreforEcologyandEvolutioninMicrobialModelSystems,LinnaeusUniversity,Kalmar,SwedenDanielLundin,JaronePinhassiScienceforLifeLaboratory,DepartmentofBiochemistryandBiophysics,StockholmUniversity,Solna,SwedenJohnSundhLeibnizInstituteforBalticSeaResearch,Warnemünde,Germany ChristinBennke,SaraBeier,KlausJürgens,MatthiasLabrenz CentreforTranslationalMicrobiomeResearch,DepartmentofMolecular,TumorandCellBiology,KarolinskaInstitutet,ScienceforLifeLaboratory,Solna,Sweden LuisaW.HugerthUniversityofTartu,InstituteofTechnology,Tartu,Estonia VeljoKisandSectionforMarineBiologicalSection,DepartmentofBiology,UniversityofCopenhagen,Helsingør,Denmark LasseRiemann

Page 2: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

2

Abstract TheBalticSeaisoneoftheworld’slargestbrackishwaterbodiesandischaracterisedbypronouncedphysicochemicalgradientswheremicrobesarethemainbiogeochemicalcatalysts.Meta-omicmethodsproviderichinformationonthecompositionof,andactivitieswithinmicrobialecosystems,butarecomputationallyheavytoperform.WeherepresenttheBAlticSeaReferenceMetagenome(BARM),completewithannotatedgenestofacilitatefurtherstudieswithmuchlesscomputationaleffort.Theassemblyisconstructedusing2.6billionmetagenomicreadsfrom81watersamples,spanningbothspatialandtemporaldimensions,andcontains6.8milliongenesthathavebeenannotatedforfunctionandtaxonomy.Theassemblyisusefulasareference,facilitatingtaxonomicandfunctionalannotationofadditionalsamplesbysimplymappingtheirreadsagainsttheassembly.Thiscapabilityisdemonstratedbythesuccessfulmappingandannotationof24externalsamples.Inaddition,wepresentapublicwebinterface,BalticMicrobeDB,forinteractiveexploratoryanalysisofthedataset.

Background & Summary TheBalticSeaisasemi-enclosedinlandseacharacterizedbystrongphysicochemicalgradients,inparticularhorizontalandverticalsalinityandoxygengradients,andpronouncedseasonaldynamics1.Thisecosystemisalsoheavilyimpactedbyanthropogeniceutrophication,manifestedine.g.harmfulphytoplanktonbloomsandlargeareaswithanoxicbottomwaters2.Duetotheirkeyrolesinbiogeochemicalcycles,microbialcommunitiesareparticularlyinterestingtostudyinthisecosystem3–11.Oneofthemostcomprehensivemethodstocharacterizethetaxonomicandfunctionalcompositionofmicrobialcommunitiesisthroughmetagenomics,andspecificallybymetagenomicassembly,whichenableshighprecisionandsensitivityforbothtaxonomicandfunctionalannotation12.Theseannotationscanbequantifiedinindividualsamplesbymappingshortreadsfromsamplesthateitherwereincludedintheassemblyorconstituteexternalsamples.Forsomemicrobiomes,particularlythoseassociatedwiththehumanbody,extensivesequencingeffortshavebeenundertakentoconstructreferencegenecataloguesthatarepubliclyavailableandcanbeutilizedbyothers13–15.Large-scalemetagenomicdatasetsalsoexistfortheglobalocean,suchastheTaraOceansdataset15.However,althoughthebrackishBalticSeaiscomposedofamixtureofmarine-andfreshwaterlikelineages5,7,10,thesearegeneticallydistinctfromtheirrelatives8,whichhindersefficientreadmappingtofresh-andmarinewatermetagenomes.

Page 3: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

3

WeherepresentaBalticSeametagenomeco-assembly(BARM;BAlticseaReferenceMetagenome)withannotatedgenesconstructedfromthreesetsofsamples,selectedtocovervariationovergeography,depthandseason(Table1,Fig.1;DataCitation1).Afterpreprocessingofthereads,the81samplescombinedcontained586billionbasesin2.6billionreadpairs.Toallowtheassemblyofgenesalsofromgenomeshavinglowabundanceinindividualsamples,datafromallsampleswereco-assembled.Theresultingco-assemblyconsistedof14billionbasesdistributedover22millioncontigs.Outofthesecontigs,2.4millioncontigswerelongerorequalto1kilobase.Functionalandtaxonomicannotationofgenesiscomputationallydemanding.Forthisreason,andsincelongercontigsweredeemedtobemoretrustworthy,onlygenesfoundonthecontigs>1kilobaseweresubjectedtofunctionalandtaxonomicannotations;6.8milliongeneswerefoundonthese.Forfunctionalanalysis,severaldatabasesourceswerechosen;Pfam16,TIGRFAM(http://www.jcvi.org/cgi-bin/tigrfams/index.cgi),EggNOG17anddbCAN18.Additionally,enzymecommission(EC)numbers19wereextractedbasedontheEggNOGassignments.Throughmapping,theshortreadswerethenusedtoquantifytheindividualgenesoverallthedifferentsamples,whichweresummarizedperannotationidentifier(ID)foreachrespectiveannotationsource.ThemappingratesforthedifferentsamplegroupsandannotationsourcesaresummarizedinFig.2,wherealso24samplesfromapublishedmetagenomicstudy20(DataCitation2)oftheBalticSeaareincludedtoillustratethecapabilitiesforBARMtoworkasareferencegenecataloguefortheBalticSea.Alongwiththedataset,apublicwebinterface(BalticMicrobeDB)wasconstructedtofacilitateexploratoryanalysisofthedata(https://barm.scilifelab.se).Throughwhichitispossibletoviewcountsoffunctionalandtaxonomicannotationsoverthedifferentsamplegroups.Moreover,itispossibletosearchforfunctionalannotationsbasedontheirdescriptivetextsandchoosetoviewordownloadthecountsforonlythosematchingthesearchquery.Theannotatedassemblypresentedhereisarichresourceforfurtherexploitationofthepublisheddatasets,facilitatedthroughthewebinterface,butcouldalsofunctionasareferencemetagenomeassemblyfortheBalticSea,decreasingthecomputationaldemandsfortheanalysisofnewmetagenomeandmetatranscriptomesamples,andserveasreferenceformetaproteomeanalyses.

Page 4: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

4

Methods

Sampling, DNA Extraction and Sequencing

37surface-water(2mdepth)samplesfromthe2012timeseries(MarchtoDecember)fromtheLinneausMicrobialObservatory(LMO)stationlocated10kmofftheeastcoastofÖlandandwherethemaximumdepthis47mhavebeendescribedinHugerthetal.(2015)8(DataCitation3)Briefly,afterprefiltrationthrough3.0μm,DNAwasextractedfrom0.2μmSterivex™cartridgefilters(Millipore)usingtheprotocoldescribedinRiemannetal.(2000)21andsequencedononeHiSeqhigh-outputflowcellwithanaverageof31.9millionpair-endreadspersample.The30transectsamplesweretakenduringacruiseinitiatedbyLeibnizInstituteforBalticSeaResearch,WarnemündeontheR/VAlkor,carriedoutfortheEU-BONUSBLUEPRINTprojectfromJune4toJune172014.SamplesforDNAanalyseswerecollectedusingacompactCTD(profilinginstrumentthatrecordsconductivity,oxygen,temperatureanddepth)SBE911PluswithaSBE-rosetteSBE32(SeaBirdElectronicsInc.,USA)equippedwith18x10LFreeFlow-PWS-samplers(HYDRO-BIOS,Kiel,Germany).Waterwassampledfromoxiczones,intherangefrom2to242mdepth,withinthesalinitygradientoftheBalticSea.ForDNAanalysis,1Lofseawaterwasdirectlyfilteredontoa47mmDuraporemembranefilterwith0.2µmporesize(GVWP04700,MerckMillipore,Darmstadt,Germany)byavacuumof<300mbar.Subsequently,thefilterswerefolded,flashfrozenusingliquidnitrogenandstoredat-80°Cuntilfurtherprocessing.DNAwasextractedusingamodifiedprotocoloftheQIAampDNAMiniKit(51304,Qiagen,Hilden,Germany)withaninitialbead-beatingstepandacleanupandconcentrationprocessusingtheZymogDNACleanandConcentratorKit(D4010,ZymoResearchEurope,Freiburg,Germany).TheconcentrationandqualityoftheelutedDNAwasassuredbygelelectrophoresisandBioanalyzerDNA12000kit(5067-1508,AgilentTechnologies,SantaClara,USA).ThesamplesweresequencedattheNationalGenomicsInfrastructureatScienceforLifeLaboratory,Stockholm,Sweden,usingafullHiSeq2500high-outputflowcellproducinganaverageof69.5millionpair-endreadspersample.TheredoxclinesamplesconsistofsamplesfromstationBoknisEck22,locatedattheentranceoftheEckernfordeBayinthesouthwesternBalticSea,andfromstationTF0271attheGotlandDeepintheeasternGotlandBasin.TheBoknisEckstationwassampledonSeptember23,2014ontheR/VLittorinaduringroutinemonitoringactivitiesperformedmonthlybytheGEOMARHelmholtzCentreforOceanResearchKiel.Duetowindyconditionsbeforethesamplingday,thewaterattheBoknisEckstationwas

Page 5: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

5

mixedovermostofthewatercolumnandonlythebottomwaterwassulfidic.Waterwassampledfromthemixedoxygenatedlayerandfromthesulfidicbottomwater,whichwascapturedona3μmporesizemembranefilters(Whatman,Maidstone,UK)followedby0.2μmporesizeSterivex-GVfilters(MilliporeBillerica,Massachusetts,USA).TheGotlandBasinwassampledduringthecruiseEMB087ontheR/VElisabethMannBorgeseonOctober18andOctober26,2014.ThesamplesfromOctober18weretakeninthecontextofanexperimentclosetotheoxic-anoxicinterfacefromsuboxicandanoxicwaterlayersandwerecaptureddirectlyon0.2μmporesizeDuraporemembranefilters(Whatman,Maidstone,UK).ThesamplesfromOctober26weretakentocoverdifferentzonesintheredoxgradient(suboxic,oxic-anoxicinterface,uppersulfidic,lowersulfidic)andwerecapturedfirstona3μmporesizemembranefilters(Whatman,Maidstone,UK)followedby0.2μmporesizeSterivex-GVfilters.DNAfromwatercapturedon3μmporesizemembranefiltersand0.2μmSterivex-GVfilterswasextractedusingtheQIAmpDNAMiniKit(Qiagen,Hilden,Germany):ATLbufferwasaddedtofilterpiecestogetherwith200μmlow-bindingZirconiumbeads(OPSDiagnostics,Lebanon,NY,USA)andthesuspensionwasvortexedfor5minutesatmaximumspeed.SubsequentlyproteinaseKwasaddedandthesuspensionwasincubatedforapproximately1hat56°CbeforecontinuingDNAextractionbyfollowingthemanufacturer’sinstructions.NucleicacidsfromGotlandBasinwatersampledonOctober18on0.2μmporesizemembranefilterswereextractedusingtheAllPrepDNA/RNAMiniKit(Qiagen,Hilden,Germany).Similarasbefore,filterswerevortexedtogetherwithZirconiumbeadsinRTLbufferbeforecontinuingnucleicacidextractionbyfollowingthemanufacturer’sprotocol.TheconcentrationandqualityoftheelutedDNAwasassuredbygelelectrophoresis.ThesamplesweresequencedonasingleHiSeq2500laneproducinganaverageof20.7millionpair-endreadspersample.Allsequencinglibraries(includingLMO)werepreparedwiththeRubiconThruPlexkit(RubiconGenomics,AnnArbor,Michigan,USA)accordingtotheinstructionsofthemanufacturer.

Preprocessing and Assembly ThequalityofthereadswerecheckedandvisualizedwithFastQC23throughMultiQC24andtrimmedfromlowqualitybaseswithcutadapt25usingPhredscore15asacutoff.Adaptersequenceswerealsoremovedusingcutadapt,keepingonlyreadpairswherebothreadsinthepairwerelongerthan31bases.PreprocessedreadswerethenassembledusingMegahit26version1.0.2withdefaultparametersincludingkmers21,41,61,81and99.Exclusivelytothe30samplesfromthetransectcruise,genomicmaterial(20ngperLofseawater)fromaknowngenomeofThermusthermophilus

Page 6: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

6

(strainHB8),whichisnotexpectedtobepresentintheBalticSeanaturally,wasaddedafterfiltrationbutpriortotheDNAextraction,servingasinternalstandardtoenableabsolutequantifications.Aligningallcontigsfromthemetagenomeassemblyagainstthisreferencegenomeshowedthat84.1%ofthegenomewasrecoveredwithincontigsaligningwithaverage99.82%identity.Theseadditionalgenomecontigswerekeptinthereferenceassemblybutreadsaligningtothereferencegenomewerefilteredoutbeforethequantificationsteps,andbeforeuploadingtheprocessedreadstotheEuropeanNucleotideArchive(ENA)(DataCitation4).

Functional Annotation GeneswerepredictedonallcontigsusingProdigal27version2.6.3withthe‘--meta’tagwhichpotentiallyusesdifferentcodingtablesfordifferentcontigs.Geneslocatedoncontigslongerorequalto1kilobase,identifiedwiththescripttoolbox/scripts/fasta_lengths.py,wereusedforfunctionalandtaxonomicannotation.Forfunctionalannotation,thedatabasesEggNOG17,Pfam16,TIGRFAM(http://www.jcvi.org/cgi-bin/tigrfams/index.cgi)anddbCAN18werechosen.Furthermore,EC-numbers19wereextractedfromtheEggNOGannotations.ToannotategeneswithEggNOG17ids,theEggNOGhmmfileforallorganisms,NOG.hmm.tar.gz,version4.5wasdownloadedfromhttp://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/.Forperformancereasons,hmmsearchwasusedinsteadofhmmscan28,initiallyremovingallhitswithanE-value>0.0001.Toselectamaximumofoneannotationpergene,thehitwithhighestscorewaschosenusingthescripttoolbox/scripts/hmmer_filtering/keep_top_score.py.Informationabouteachannotationwasdownloadedfromhttp://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/NOG.annotations.tsv.gz.AnEnzymeCommision(EC)number19wasassignedtoeachEggNOGthroughtheUniprot29proteinsincludedintheEggNOGmodel,ifamajorityofitsEC-assignedmemberswereassignedtothatEC.NotethatproteinscouldhavemultipleECnumbersassignedandthereforesomeEggNOGswereassignedmultipleECnumbers.Thefilesneededfortheconversionwereeggnog4.protein_id_conversion.tsv.gz(downloadedfromhttp://eggnogdb.embl.de/download/eggnog_4.5/onJanuary9th2017)andNOG.members.tsv.gz(downloadedfromhttp://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/onJanuary9th2017).TheproteinidconversionfilegivesECnumbersperreferenceproteinandthemembersfilegivesthereferenceproteinsthatbuildeachmodel.Theproteinwithtaxaid400682andproteinid“PAC”wasremovedfromtheproteinidconversionfilesinceithad695ECentries.Likewisefor

Page 7: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

7

taxaid7070andproteinid“TCOGS2”,with686ECentries.Theproteinidwiththethirdmostentrieshad6entriesandthereforethetwoothersweredeemedasoutliers.ThesuspectedreasonisthattheseentriesbelongtodifferentgenesforthesegenomesbuttherewerenowaytoresolvethisandtheEC-numberassignmentforeachEggNOGwasdeemedtonotbeaffectedbythis.GiventheassignmentofEC-numbersperEggNOG,theassignmentpergenewasdonewithtoolbox/scripts/assign_ec_from_nog.py.AnnotationagainstthedbCAN18(DataBaseforautomatedCarbohydrate-activeenzymeANnotation)databasewasperformedusingversion5(downloadedfromhttp://csbl.bmb.uga.edu/dbCAN/download.php).FollowingtheinstructionsfromdbCAN(downloadedfromhttp://csbl.bmb.uga.edu/dbCAN/download/readme.txt),hmmscan28wasusedwiththeoption--domtbloutandtheresultwasfurthertreatedwiththerecommendedscripthmmscan-parser.sh(referenceofusedscriptavailablewithintoolbox/third_party_scripts/dbcan/hmmscan-parser.sh)fromdbCANrequiringacoveredfractionoftheHMMlargerthan0.3andkeepinglongalignments(>80aminoacids)iftheE-valuewaslessthan1e-5andshortalignmentsiftheE-valuewaslessthan1e-3.Anadditionalscripttoolbox/hmmer_filtering/dbcan_strict_filtering.pywasapplied,choosingtofollowrecommendationsforbacteriafromdbCAN,keepingannotationswithe-valuelessthan1e-18andalignmentcoveragegreaterthan0.35.Toallowformorethanasingledomainwithinagene,anyannotationwhichfulfilledthesecriteriawaskept.Informationabouteachannotationwascollected(downloadedfromhttp://csbl.bmb.uga.edu/dbCAN/download/FamInfo.txt).AnnotationagainstPfam16version30.0wasconductedwiththescriptpfam_scan.plsuppliedfromtheftp://ftp.ebi.ac.uk/pub/databases/Pfam/Toolsforversion28.0,usinghmmerversion3.1b128.Toallowformorethanasingledomainwithinagene,anyannotationwhichfulfilledthesecriteriawaskept.Informationabouteachannotationwascollectedascolumns1,2and4fromthefilepfamA.txt.gzdownloadedfromftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/database_files/onJanuary11th2017.AnnotationagainstTIGRFAMversion15,wasperformedusinghmmsearch(v.3.1b2)28with--cut_tcargumenttofiltermodelsbytrustedcutoff.Foreachproteinsequence,thebestscoringHMMwasselectedusinghmmparse.pyavailableathttps://github.com/johnne/biotools/blob/master/scripts/hmmparse.py

Page 8: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

8

Taxonomic annotation Themethodusedtoassigntaxonomywaschoseninordertoassignasmanycontigsaspossibletoataxonomywhilestillkeepingfalsepositivestoalowlevel.Asthenumberofsequencesinreferencedatabasescloselyrelatedtothegenomesinoursampleswasexpectedtobelow8,aminoacidsequencesfromtheassemblywereusedtocompareagainstotheraminoacidsequencesinthereferencedatabase,enablinghighersensitivity(duetothemoreconservednatureofaminoacidsequences).ThiscomparisonwasdoneusingDiamondversion0.8.2630withtheparameters“--segyes”,”--sensitive”and“--top10”againsttheNCBInrdatabasedownloadedDecember2nd2016.ThecodeusedtoassigntaxonomyfromtheDiamondsearchwasbasedonanoriginalavailableintheDESMANpackage31andthemodifiedversionofthecodeisavailableasthescripttoolbox/scripts/taxonomy_from_genes_to_contigs/lca_per_contig.py.Theassignmentwasdoneasfollows:allreportedhitsfromtheDiamondsearchweregivenaweightbasedonthealignedfractionofthequeryandthepercentageidentityofthealignment.Ateachtaxonomiclevel,ifthesumoftheweightsforonetaxonwasgreaterthanhalfthesumofallweights,thegenewasassignedtothattaxonaslongasthepercentageidentitywashighenough.Thelevelsforthepercentageidentityweresetto40%atsuperkingdomlevel,50%atphylumlevel,60%atclasslevel,70%atorderlevel,80%atfamilylevel,90%atgenusleveland95%atspecieslevel.Taxonomicassignmentsweresetpercontigtothemostdetailedlevelwhereconsensusforatleast50%oftheweightsofthepreliminarygeneassignmentscouldbeachieved.Geneswithouttaxonomicannotationwereignored.Thesharedassignmentwaspropagatedtoallgenespresentonthatcontig.Inthisway,allgenespresentononecontigwillalwayssharethetaxonomicassignment.Ifnosinglesuperkingdomaccountedforamajorityofthegeneassignmentweightsforacontig,thecontigwasleftunassigned.

Quantification and Normalization Tousethemetagenomeassemblyasareferenceassembly,individualsamplesarefunctionallyandtaxonomicallyannotatedbyquantifyingthedifferentannotationspresentintheassembly.Thisisdonebymappingallshortreadsagainsttheassemblyandquantifyinggenes,andtherebyanyassociatedannotation,withthenumberofreadsmappingtothem.Morespecificallybowtie232version2.2.6wasusedwiththeparameter“--local”formapping,duplicatedreadswereremovedwithpicardversion1.118,bam-filesortingwasdonewithSamtools33version1.3,andthehtseq-countscriptfromhtseq34version0.6.1wasusedtogetrawcountspergene.

Page 9: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

9

Countsperannotationwasachievedbysummingallcountsforgenesannotatedwitheachrespectiveannotation.Whenquantifyingannotationtypeswheremultipleannotationswereallowedforasinglegene(dbCANandPfam),somegenescontributedseveraltimestothequantities.Thiswaskeptinordertofacilitateanalysisofdifferentialabundancefortheindividualannotations.Alongwithrawcountsofreadsforeachannotationtypeandtaxonomy,acountnormalizedbygenelengthandnumberofreadswasalsocalculated.Borrowingtheformulaandthetermfromtranscriptomics,wecalculateTPM(TranscriptsPerMillion)values35:TPM

𝑇𝑃𝑀 =𝑟! ⋅ 𝑟𝑙 ⋅ 10!

𝑓𝑙! ⋅ 𝑇

𝑇 =!∈!

𝑟! ⋅ 𝑟𝑙𝑓𝑙!

Where𝑟!isthenumberofreadsmappedtogene𝑔fromthesample,𝑟𝑙istheaveragereadlengthforthesample,𝑓𝑙!isthelengthofthegeneand𝐺isthesetofallgenes.Tisaconveniencevariablefortheindicatedsumoverallgenes.

Code availability Codeusedtopreprocessreads,assemblecontigsandannotategenesispubliclyavailableathttps://github.com/EnvGen/BLUEPRINT_pipeline,containingthepipelinedefinitionoftheworkflowsused,https://github.com/EnvGen/snakemake-workflows,wherethesnakemakerulesarespecifiedinordertobuildthecommandusedforeachstep,andthebranchBARM_publicationofhttps://github.com/EnvGen/toolbox,forcustomscripts.Scriptswithinthelatterrepositorythathavebeenusedhavebeenindicatedthroughoutthetext.

Data Records ThepreprocessedsequencingreadsfromtheTransectandRedoxclinesamplesweresubmittedtoENAhostedbyEMBL-EBIunderthestudyaccessionnumberPRJEB22997(DataCitation4).TherawreadsfromLMOwerepublishedelsewhere8andareaccessibleatNCBI(DataCitation3).Contig,geneandproteinsequencesfromtheco-assemblyoftheTransect,RedoxclineandLMOsamples,aswellasquantificationtables,contextual

Page 10: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

10

dataforthesamples,andtheannotationsforeachgeneareaccessibleonFigshare(DataCitation1).Therawsequencingreadsfromtheexternalsamplesusedforevaluationwerealsopublishedelsewhere36andareaccessibleatNCBI(DataCitation2).

Technical Validation ThemappingratesforallsamplesincludedinthereferenceassemblyareshowninFig.2,wherethemajorityofsamplesincludedintheassemblyreachesalevelabove80%.Thisservesasavalidationofthecompletenessofthemetagenomeassembly.Thefractionofreadsthatdidnotmaptothecoassembly,andwerehencenotassembledpastthe200baseslengthcutoffmostlikelyoriginatefromlowabundancespecies,orspecieswithhighintraspeciesdiversitygeneratingfragmentedassemblies.ThemappingrateoftheexternalsamplesshowsthecapabilityforthisassemblytoserveasareferencemetagenomeassemblyfortheBalticSea.Theseexternalsamples36werecollectedinadifferentyear(2011)andastation(58.82N17.63E)separatefromwherethesamplesincludedintheassemblyweretaken.ThisrepresentsarealisticscenariowhereBARMisusedasareferencemetagenomefortheBalticSea.Themappingratesvarywiththefilterfractions,wherereadsoriginatingfromthelargest(3.0-200μm)andsmallest(<0.1μm)fractionsdisplayedlowerratesthanthetwointermediatefractions(0.1-0.8μmand0.8-3.0μm),indicatingthatpicoplanktonarebetterrepresentedinBARMthanlargereukaryoticplanktonandviruses.Assignmentratesfordifferentannotationtypes,asshowninFig.3,areinthemajorityofcasesbelow10%ofthetotalnumberofreads,whichisexpectedsinceonlygenesonlongcontigs(representing40%ofthebasesofthetotalassembly)werepredictedandsubjectedtoannotation.Thefractionofreadsannotatedamongreadsmappingtogenesincludedintheannotationprocedurereacheswellover30%forPfamandshowsthegeneralityofthatdatabaseascomparedtoi.e.dbCAN,amuchmorenichedresource,whichreachesonlyaround2%ofreadsmappingtogenesincludedintheannotation.ThefunctionalannotationwasfurthervalidatedthroughanNMDSplot(Fig.4)basedontheEggNOGannotationsofthetransectdata.Depthwasfoundtobenegativelycorrelatedwiththefirstdimension(Spearman’srankcorrelationρ=-0.73,P=5.4-06)andsalinitywasnegativelycorrelatedwiththeseconddimension(Spearman’srankcorrelationρ=-0.77,P=2.4-06).ThesetwoenvironmentalparametershavepreviouslybeenfoundtocorrelatestronglywiththemicrobialcommunityintheBalticSea5whichstrengthensourtrustintheEggNOGannotations.Furthermore,analyzinga

Page 11: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

11

singleannotationwithaknownfunction,namelythephotosyntheticreactioncentreprotein(PF00124)wecouldseeastrongnegativecorrelationwithsamplingdepthoverthethirtytransectsamples(Spearmancorrelationcoefficientρ=-0.87,P=3.1-10).Thetaxonomicannotationwasvalidatedbyinspectingthetaxonomicprofileofthetransectsamples.Thesamedominantprokaryotictaxonomicgroupswereobservedasinpreviouspan-Balticampliconsequencingandmetagenomicstudies5,7,10,11,andtheoveralltrendswereconservedwithanincreaseinAlpha-andGammaproteobacteriaandadecreaseinActinobacteriaandBetaproteobacteriawithincreasingsalinitylevels(Fig.5).AmongthepredictedproteinsinBARM,98%lackedhitswithaminoacididentitiesabove95%,hencepotentiallyrepresentingspeciesforwhichsequencedgenomesarelacking37.31%ofthesequenceslackedsignificanthits(E-value>1)andpotentiallycorrespondtonovelproteinfamilies.

Usage Notes Apubliclyavailablerepositoryathttps://github.com/EnvGen/BARM_toolshostsinstructionsandapipelineonhowtoquantifygenesandtheirannotationswithinBARMforanykindofBalticSeametagenomicandmetatranscriptomicsamples.ThewebinterfaceBalticMicrobeDB,availabletothepublicathttp://barm.scilifelab.se,canbeusedtoexploreandaccessdataforthethreesamplesetsthattheassemblyisbasedupon.Attheindexpage,theusercanchoosewhethertoaccessfunctionalannotationsortaxonomicannotations.Forthefunctionalannotations,theusercanselectspecificannotationsourcesandidentifiersandselectthesamplegroupsforwhichthecountswillbedisplayed.Furthermore,atextsearchovertheidentifiersandthedescriptionsoftheannotationscanbeusedtocreateacustomtableofcountsovertheselectedsamples.Fortaxonomicannotations,countsforthetoplevelsuperkingdomarefirstpresentedbuttheusercanunfoldataxonomictreetoselectanytaxontoviewcountsfor.

Acknowledgements ThisworkresultedfromtheBONUSBlueprintprojectsupportedbyBONUS(Art185),fundedjointlybytheEUandtheSwedishResearchCouncilFORMAS,TheFederalMinistryofEducationandResearch(BMBF),EstonianResearchCouncil,andDanishCouncilforIndependentResearch.ItwasalsofundedbytheSwedishResearchCouncilVR(grant2011-5689)througha

Page 12: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

12

granttoA.F.A.ComputationswereperformedonresourcesprovidedbytheSwedishNationalInfrastructureforComputing(SNIC)throughtheUppsalaMultidisciplinaryCenterforAdvancedComputationalScience(UPPMAX).DNAsequencingwasconductedattheSwedishNationalGenomicsInfrastructure(NGI)atScienceforLifeLaboratory(SciLifeLab)inStockholm.TheRedoxclinesamplesweremadepossiblethroughgeneroussupportfromtheBoknisEckcoastaltimeseriesstation(http://www.bokniseck.de)runbytheChemicalOceanographyResearchUnitatGEOMAR.

Author Contributions J.A.,J.S.andA.F.A.designedtheanalysisworkflow.J.A.,D.L.andA.F.A.designedtheBalticMicrobeDBstructure.J.A.andJ.S.implementedtheanalysisworkflow.V.Ktestedandimprovedtheworkflow.J.A.implementedBalticMicrobeDB.J.A.andJ.S.conductedtheanalysis.C.B.,S.B.,M.L.,andK.J.conductedsamplingandextractedDNA.J.A.andA.F.A.wrotemanuscriptwithcontributionsfromallauthors.Allauthorsreadandapprovedthefinalversionofthemanuscript.

Competing interests Theauthorsdeclarenocompetingfinancialinterests.

Corresponding author CorrespondencetoAndersF.Andersson.

Page 13: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

13

Figures

Figure1:Amapshowingthelocationsforallstationswheresamplesweretaken.

Figure1.Thethreesamplegroupsincludedintheassembly(Transect,LMOandRedoxcline)aredisplayedtogetherwiththeexternalsampleset20(External),allgroupsindicatedwithdifferentmarkers.Thecolourofthemarkerindicatesthesalinityofthewatersamplewhilethesizeindicatesthedepthatwhichitwastaken.Thebackgroundcolorindicatesdepth(fromwhitetodarkblue),withcontourlinesdrawnwith50mintervals.ThemapwasgeneratedusingtheMarmappackage38inR39withbathymetricdatafromtheETOPO1datasethostedontheNOAAserver40.

Page 14: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

14

Figure2:Mappingratesdividedondifferentsamplegroups.

Figure2.MappingratesarecalculatedbyBowtie232asthe“overallalignmentrate”.Thethreefirstsamplegroups;LMO2012(N=37,0.2-3.0μm),BalticTransect2014(N=30,>0.2μm)andBalticRedoxcline2014(N=6,0.2-3.0μm;N=6,>3.0μm;N=2,>0.2μm)wereincludedintheassembly,whilethefourlastsamplegroups;ExternalSamples<0.1μm(N=6),ExternalSamples0.1-0.8μm(N=6),ExternalSamples0.8-3.0μm(N=6)andExternalSamples3.0-200μm(N=6)werenot.Thesizeintervalsoftheexternalsamplesindicatefilterporesizesusedtotentativelyseparateviruses,free-livingprokaryotes,andsmallandlargerparticlesaswellasEukaryoticcells,respectively36.CreatedusingMatplotlib41andSeaborn42

Page 15: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

15

Figure3:Fractionofreadsmappingtogenesannotatedwithrespectivedatabase.

Figure3.Onlygenesidentifiedoncontigslongerthan1kilobaseweresubjectedtoannotation,definingthe‘includedgenes’category.N=81forallcategories.CreatedusingMatplotlib41andSeaborn42

Page 16: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

16

Figure4:NMDSofEggNOGannotstions.

Figure4.Non-metricdimensionalscaling(NMDS)ofthe30samplesincludedintheTransectsamplegroupbasedonEggNOGannotation.Samplesarecoloredandsizedaccordingtosalinityanddepth,respectively.CreatedusingMatplotlib41andSeaborn42

Page 17: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

17

Figure5:Taxonomicprofilesofthe10transectsamplesobtainedfromsurfacewaters.

Figure5:Numbersonx-axisindicatesalinity,giveninpracticalsalinityunits(PSU),andaresortedwithincreasingsalinitytotheright.CreatedusingMatplotlib41andSeaborn42

References 1. Snoeijs-Leijonmalm,P.&Andrén,E.WhyistheBalticSeasospecialtolivein?

BiologicalOceanographyoftheBalticSea23–84(Springer,Dordrecht,2017).2. Blenckner,T.,Österblom,H.,Larsson,P.,Andersson,A.&Elmgren,R.BalticSea

ecosystem-basedmanagementunderclimatechange:Synthesisandfuturechallenges.Ambio44Suppl3,507–515(2015).

3. Riemann,L.etal.Thenativebacterioplanktoncommunityinthecentralbalticseaisinfluencedbyfreshwaterbacterialspecies.Appl.Environ.Microbiol.74,503–515(2008).

4. Andersson,A.F.,Riemann,L.&Bertilsson,S.PyrosequencingrevealscontrastingseasonaldynamicsoftaxawithinBalticSeabacterioplanktoncommunities.ISMEJ.4,171–181(2010).

5. Herlemann,D.P.etal.Transitionsinbacterialcommunitiesalongthe2000kmsalinitygradientoftheBalticSea.ISMEJ.5,1571–1579(2011).

6. Thureborn,P.etal.AmetagenomicstransectintothedeepestpointoftheBalticSearevealsclearstratificationofmicrobialfunctionalcapacities.PLoSOne8,e74983(2013).

Page 18: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

18

7. Dupont,C.L.etal.Functionaltradeoffsunderpinsalinity-drivendivergenceinmicrobialcommunitycomposition.PLoSOne9,e89549(2014).

8. Hugerth,L.W.etal.Metagenome-assembledgenomesuncoveraglobalbrackishmicrobiome.GenomeBiol.16,279(2015).

9. Lindh,M.V.etal.Disentanglingseasonalbacterioplanktonpopulationdynamicsbyhigh-frequencysampling.Environ.Microbiol.17,2459–2476(2015).

10. Hu,Y.O.O.,Karlson,B.,Charvet,S.&Andersson,A.F.DiversityofPico-toMesoplanktonalongthe2000kmSalinityGradientoftheBalticSea.Front.Microbiol.7,679(2016).

11. Herlemann,D.P.R.,Lundin,D.,Andersson,A.F.,Labrenz,M.&Jürgens,K.PhylogeneticSignalsofSalinityandSeasoninBacterialCommunityCompositionAcrosstheSalinityGradientoftheBalticSea.Front.Microbiol.7,1883(2016).

12. Darzi,Y.,Falony,G.,Vieira-Silva,S.&Raes,J.Towardsbiome-specificanalysisofmeta-omicsdata.ISMEJ.10,1025–1028(2016).

13. HumanMicrobiomeProjectConsortium.Structure,functionanddiversityofthehealthyhumanmicrobiome.Nature486,207–214(2012).

14. Li,J.etal.Anintegratedcatalogofreferencegenesinthehumangutmicrobiome.Nat.Biotechnol.32,834–841(2014).

15. Sunagawa,S.etal.Structureandfunctionoftheglobaloceanmicrobiome.Science348,1261359(2015).

16. Finn,R.D.etal.ThePfamproteinfamiliesdatabase:towardsamoresustainablefuture.NucleicAcidsRes.44,D279–85(2016).

17. Huerta-Cepas,J.etal.EGGNOG4.5:Ahierarchicalorthologyframeworkwithimprovedfunctionalannotationsforeukaryotic,prokaryoticandviralsequences.NucleicAcidsRes.44,D286–D293(2016).

18. Yin,Y.etal.dbCAN:awebresourceforautomatedcarbohydrate-activeenzymeannotation.NucleicAcidsRes.40,W445–51(2012).

19. Webb,E.C.&Others.Enzymenomenclature1992.RecommendationsoftheNomenclatureCommitteeoftheInternationalUnionofBiochemistryandMolecularBiologyontheNomenclatureandClassificationofEnzymes.(AcademicPress,1992).

20. Asplund-Samuelsson,J.etal.DiversityandExpressionofBacterialMetacaspasesinanAquaticEcosystem.Front.Microbiol.7,1043(2016).

21. Riemann,L.,Steward,G.F.&Azam,F.Dynamicsofbacterialcommunitycompositionandactivityduringamesocosmdiatombloom.Appl.Environ.Microbiol.66,578–587(2000).

22. Bange,H.W.&Malien,F.HydrochemistryfromtimeseriesstationBoknisEckfrom1957to2014.doi:10.1594/PANGAEA.855693(2015).

23. Andrews,S.&Others.FastQC:aqualitycontroltoolforhighthroughputsequencedata.(2010).

24. Ewels,P.,Magnusson,M.,Lundin,S.&Käller,M.MultiQC:summarizeanalysisresultsformultipletoolsandsamplesinasinglereport.Bioinformatics32,3047–3048(2016).

25. Martin,M.Cutadaptremovesadaptersequencesfromhigh-throughputsequencingreads.EMBnet.journal17,10–12(2011).

26. Li,D.,Liu,C.-M.,Luo,R.,Sadakane,K.&Lam,T.-W.MEGAHIT:anultra-fastsingle-nodesolutionforlargeandcomplexmetagenomicsassemblyviasuccinctdeBruijngraph.Bioinformatics31,1674–1676(2015).

Page 19: BARM and BalticMicrobeDB, a reference metagenome and ...1205852/FULLTEXT01.pdf · Moreover, it is possible to search for functional annotations based on their descriptive texts and

19

27. Hyatt,D.etal.Prodigal:prokaryoticgenerecognitionandtranslationinitiationsiteidentification.BMCBioinformatics11,119(2010).

28. Eddy,S.R.AcceleratedProfileHMMSearches.PLoSComput.Biol.7,e1002195(2011).

29. UniProtConsortium.UniProt:ahubforproteininformation.NucleicAcidsRes.43,D204–12(2015).

30. Buchfink,B.,Xie,C.&Huson,D.H.FastandsensitiveproteinalignmentusingDIAMOND.Nat.Methods12,59–60(2015).

31. Quince,C.etal.DESMAN:anewtoolfordenovoextractionofstrainsfrommetagenomes.GenomeBiol.18,181(2017).

32. Langmead,B.&Salzberg,S.L.Fastgapped-readalignmentwithBowtie2.Nat.Methods9,357–359(2012).

33. Li,H.etal.TheSequenceAlignment/MapformatandSAMtools.Bioinformatics25,2078–2079(2009).

34. Anders,S.,Pyl,P.T.&Huber,W.HTSeq—aPythonframeworktoworkwithhigh-throughputsequencingdata.Bioinformatics31,166–169(2015).

35. Wagner,G.P.,Kin,K.&Lynch,V.J.MeasurementofmRNAabundanceusingRNA-seqdata:RPKMmeasureisinconsistentamongsamples.TheoryBiosci.131,281–285(2012).

36. Larsson,J.etal.PicocyanobacteriacontaininganovelpigmentgeneclusterdominatethebrackishwaterBalticSea.ISMEJ.8,1892–1903(2014).

37. Konstantinidis,K.T.&Tiedje,J.M.Towardsagenome-basedtaxonomyforprokaryotes.J.Bacteriol.187,6258–6264(2005).

38. Pante,E.&Simon-Bouhet,B.marmap:Apackageforimporting,plottingandanalyzingbathymetricandtopographicdatainR.PLoSOne8,e73051(2013).

39. TheRCore-team.R:Alanguageandenvironmentforstatisticalcomputing.(RFoundationforStatisticalComputing,2014).

40. Amante,C.&Eakins,B.W.ETOPO11arc-minuteglobalreliefmodel:procedures,datasourcesandanalysis.(USDepartmentofCommerce,NationalOceanicandAtmosphericAdministration,NationalEnvironmentalSatellite,Data,andInformationService,NationalGeophysicalDataCenter,MarineGeologyandGeophysicsDivisionColorado,2009).

41. Hunter,J.D.Matplotlib:A2DGraphicsEnvironment.Comput.Sci.Eng.9,90–95(2007).

42. Waskom,M.etal.seabornv0.7.0.Zenodo,https://doi.org/10.5281/zenodo.45133,(2016).

Data Citations 1. Alneberg,J.AnderssonA.F.Figshare

https://doi.org/10.6084/m9.figshare.c.3831631(2017)preview-link:https://figshare.com/s/3171d2e7c0f470fc488a

2. NCBIBioProjectPRJNA322246(2016)3. NCBIBioProjectPRJNA273799(2015)4. EuropeanNucleotideArchivePRJEB22997(2017)