Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker...
Transcript of Reproducible Bioinformatics Project: A community for ... · 91 demonstrative workflows (i.e. docker...
ReproducibleBioinformaticsProject:Acommunityforreproducible1
bioinformaticsanalysispipelines2
NehaKulkarni1,LucaAlessandrì1,RiccardoPanero1,MaddalenaArigoni1,MartinaOlivero2,3
FrancescaCordero3$,MarcoBeccuti3andRaffaeleACalogero1$4
5
1Dept.ofMolecularBiotechnologyandHealthSciences,UniversityofTorino,Torino,Italy6
2Dept.ofOncology,UniversityofTorino,Candiolo,Italy7
3Dept.ofComputerSciences,UniversityofTorino,Torino,Italy8
9
Neha Kulkarni [email protected] 10
Luca Alessandrì [email protected] 11
Riccardo Panero [email protected] 12
Maddalena Arigoni [email protected] 13
Martina Olivero [email protected] 14
Francesca Cordero [email protected] 15
Marco Beccuti [email protected] 16
Raffaele A Calogero [email protected] 17
18
$Corresponding author 19
20
Abstract21
BackgroundReproducibilityofaresearchisakeyelementinthemodernscienceanditis22
mandatoryforanyindustrialapplication.Itrepresentstheabilityofreplicatinganexperiment23
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
2
independentlybythelocationandtheoperator.Therefore,astudycanbeconsidered24
reproducibleonlyifalluseddataareavailableandtheexploitedcomputationalanalysisworkflow25
isclearlydescribed.However,todayforreproducingacomplexbioinformaticsanalysis,theraw26
dataandalistoftoolsusedintheworkflowcouldbenotenoughtoguaranteethereproducibility27
oftheresultsobtained.Indeed,differentreleasesofthesametoolsand/orofthesystemlibraries28
(exploitedbysuchtools)mightleadtosneakyreproducibilityissues.29
ResultsToaddressthischallenge,weestablishedtheReproducibleBioinformaticsProject(RBP),30
whichisanon-profitandopen-sourceproject,whoseaimistoprovideaschemaandan31
infrastructure,basedondockerimagesandRpackage,toprovidereproducibleresultsin32
Bioinformatics.OneormoreDockerimagesarethendefinedforaworkflow(typicallyoneforeach33
task),whiletheworkflowimplementationishandledviaR-functionsembeddedinapackage34
availableatgithubrepository.Thus,abioinformaticianparticipatingtotheprojecthasfirstlyto35
integrateher/hisworkflowmodulesintoDockerimage(s)exploitinganUbuntudockerimage36
developedadhocbyRPBtomakeeasierthistask.Secondly,theworkflowimplementationmust37
berealizedinRaccordingtoanR-skeletonfunctionmadeavailablebyRPBtoguarantee38
homogeneityandreusabilityamongdifferentRPBfunctions.Moreovershe/hehastoprovidethe39
Rvignetteexplainingthepackagefunctionalitytogetherwithanexampledatasetwhichcanbe40
usedtoimprovetheuserconfidenceintheworkflowutilization.41
ConclusionsReproducibleBioinformaticsProjectprovidesageneralschemaandaninfrastructure42
todistributerobustandreproducibleworkflows.Thus,itguaranteestofinaluserstheabilityto43
repeatconsistentlyanyanalysisindependentlybytheusedUNIX-likearchitecture.44
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
3
Keywords45
Reproducibleresearch,docker,wholetranscriptomesequencing,miRNAsequencing,ChIP46
sequencing,community,SNV.47
Background48
RecentlyBakerandLithgow[1,2]highlightedtheproblemofthereproducibilityinresearch.49
Reproducibilitycriticalityaffectstodifferentextentalargeportionofthesciencefields[1].Since50
nowadaysbioinformaticsplaysanimportantroleinmanybiologicalandmedicalstudies[3],a51
greateffortmustbeputtomakesuchcomputationalanalysesreproducible[4,5].Reproducibility52
issuesinbioinformaticsmightbeduetotheshorthalf-lifeofthebioinformaticssoftware,the53
complexityofthepipelines,theuncontrolledeffectsinducedbychangesinthesystemlibraries,54
theincompletenessorimprecisioninworkflowdescription,etc.Todealwithreproducibilityissues55
inBioinformaticsSandve[5]suggestedtengoodpracticerulesforthedevelopmentofa56
computationalworkflow(Table1).AcommunitythatfulfillsomeoftherulessuggestedbySandve57
isBioconductor[6]project,whichprovidesversioncontrolforalargeamountof58
genomics/bioinformaticspackages.Inthisway,oldreleasesofanyBioconductorpackagearekept59
availablefortheusers.However,Bioconductordoesnotcoverallthestepsofanypossible60
bioinformaticsworkflow,e.g.inRNAseqwolkflowfastqtrimmingandalignmentstepsare61
generallydoneusingtoolsnotimplementedinBioconductor.BaseSpace[7,8]andGalaxy[9]62
representanexampleofbothcommercialandopen-sourcesolutions,whichpartiallyfulfill63
Sandve’sroles.Furthermore,theworkflowsimplementedinsuchenvironmentscannotbeheavily64
customized,e.g.BaseSpacehasstrictrulesforapplicationssubmission.Moreover,clouds65
applications,asBaseSpace,havetocopewithlegalandethicalissues[10].Ontheotherhand,66
Galaxydoesnotprovidestandardizedmetadatatoannotateworkflows.67
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
4
Recentlycontainertechnology,alightweightOS-levelvirtualization,wasexploredintheareaof68
Bioinformaticstomakeeasierthedistribution,theutilizationandthemaintenanceof69
bioinformaticssoftware[11-13].Indeed,sinceapplicationsandtheirdependenciesarepackaged70
togetherinthecontainerimage,theusershavenottodownloadandinstallallthedependencies71
requiredbyanapplication,thusavoidingallthecaseswherethedependenciesarenotwell72
documentedornotavailableatall.Moreover,problemsrelatedtoversionsconflictsorupdatesof73
thesystemlibrariesdonotoccur,becausethecontainersareisolatedfromtherestofthe74
operatingsystem.75
Amongtheavailablecontainerplatforms,Docker(http://www.docker.com)isbecomingdefacto76
thestandardenvironmenttoquicklycompose,create,deploy,scaleandoverseecontainerized77
applicationsunderLinux.Itsstrengthsarethehighdegreeofportability,whichallowsusersto78
registerandsharecontainersovervarioushostsinprivateandpublicrepositories;amore79
effectiveresourceuseandafasterdeploymentcomparedwithothersoftware.80
Although,Menegidio[13],daVeiga[11]andKim[12]providedalargecollectionofbioinformatics81
instrumentsbasedonDockertechnology,todaywearemissingacommunitydeliveringto82
bioinformaticiansacontrolled,butflexibleframeworktodistributeDockerbasedworkflowsunder83
theumbrellaofareproducibilityframework.Here,wedescribetheimplementationofthe84
ReproducibleBioinformaticsProject(RBP,http://reproducible-bioinformatics.org/),aimingto85
distributetothebioinformaticscommunitydocker-basedapplicationsunderthereproducibility86
frameworkproposedbySandve[5].RBPacceptssimpledockerimplementationsofbioinformatics87
software(e.g.adockerembeddingbwaalignertool),implementationofcomplexpipelines88
involvingtheuseofmultipledockersimages(e.g.aRNAseqworkflowprovidingallthestepsforan89
analysisstartingfromthequalitycontrolofthefastqtodifferentialexpression),aswellas90
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
5
demonstrativeworkflows(i.e.dockerimagesembeddingthefullbioinformaticsworkflowusedina91
publication)intendedtoprovidetheabilitytoreproducepublisheddata.92
Implementation93
TheReproducibleBioinformaticsProject(RBP)referencewebpageisreproducible-94
bioinformatics.org.Theprojectisbasedonthreemodules(Figure1):(i)docker4seqRpackage95
(https://github.com/kendomaniac/docker4seq),(ii)dockersimages96
(https://hub.docker.com/u/repbioinfo/),and(iii)4SeqGUI97
(https://github.com/mbeccuti/4SeqGUI).98
Docker4seqpackageprovidestheconnectionbetweenusersanddockercontainers.Docker4seqis99
organizedintwobranches:stableanddevelopment.Thetransitionbetweendevelopmentand100
stablebranchisdonewhenamodule(Rfunction(s)/dockercontainer(s))fulfillsthe10rules101
suggestedbySandve[5]forgoodbioinformaticspractice(Table1):102
Thefunctionskeleton.Rindocker4seqprovidesaprototypetobuildadockercontrollingfunction.103
Acknowledgmentsofthedeveloperworkisprovidedwithinthestructureoftheskeleton.R.In104
skeleton.Rthereisafieldindicatingdeveloperaffiliationandemailforcontacts.Indockerimages105
repositorydocker.io/repbioinfoisavailableanUbuntuimage,asprototypeforthecreationofa106
dockerimagecompliantwiththeRBPspecifications.Developerisfreetodecidetousethis107
prototypeortoadaptadifferentLinuxdockerdistributionforhis/herapplication.Dockerimages108
designedbythecoredevelopersofRBParelocatedindocker.io/repbioinfo(docker.com),the109
imagesdevelopedbythirdpartiescanbeinsteadplacedinanypublic-accessdockerrepository.110
RBPrequiresthatanyoperation,implyingtheuseofanyR/Bioconductorpackagesortheuseofan111
externalsoftware,hastobeimplementedinadockercontainer.Onlyreformattingactions,e.g.112
tableassembly,datareordering,etc.,canbehandledoutsideadockerimage.113
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
6
AnynewRBPmodule(Rfunction(s)/dockerimage(s))mustbeassociatedwithanexplanatory114
vignette,accessibleonlineashtmldocument,andtoasetoftestdata,alsoaccessibleonline.115
Thus,allinstrumentsneededtoacquireconfidenceonmodulefunctionalitiesareprovidedtothe116
finaluser.117
DockerimagesarelabelledwiththeextensionYYYY.NN,whereYYYYistheyearofinsertioninthe118
stableversionandNNaprogressivenumber.YYYYchangesonlyifanyupdateontheprogram(s),119
implementedinthedockerimage,isdone.Thisbecauseanyofsuchupdateswillaffectthe120
reproducibilityoftheworkflow.Previousversion(s)willbealsoavailableintherepository.NN121
referstochangesinthedockerimage,whichdonotaffectthereproducibilityoftheworkflow.122
Anewmodulecanbesubmittedtotheinfo@reproducible-bioinformatics.organdRBPcoreteam123
willverifythecompliancewithSandve[5]rules.Onesvalidated,theRfunctionscontrollingthe124
newmoduleareinsertedindocker4seqstablerelease.Partiallyvalidatedmoduleswillbeplacedin125
developmentbranchandmovedtostableonewhencompliancewithSandve’srulesisfulfilled.126
4SeqGUIisaJavabasedgraphicalinterfacetodocker4seqfunctions.Itisdesignedtoprovidea127
GUItousershavinglimitedknowledgeofRscripting.CurrentlytheGUIembedsonlygeneral-128
purposeworkflows,suchasRNAseq,miRNAseqandChip-seqworkflow.129
Results130
Thestablebranchofdocker4seqRpackagecontainsalltheRfunctionsrequiredtohandleallthe131
stepsofRNAseqworkflow(Fig.2A),ChIPseqworkflow(Fig.2B),andmiRNAseqworkflow(Fig.2C).132
Docker4seqalsoprovidesawrapperfunctionforthebcl2fastqIlluminatooltoconverttheIllumina133
sequenceroutputindemultiplexedfastqfiles(Fig.2).Then,thefastqfilescanbehandledwithany134
ofthethreedifferentworkflows.ThecountstableproducedbyRNAseqormiRNAseqworkflows135
canbeusedfordatavisualization(pca,principalcomponentanalysisfunction),toevaluatethe136
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
7
statisticalpoweroftheexperiment(experimentPowerfunction),todefinetheoptimalsamplesize137
oftheexperimentforthedetectionofdifferentiallyexpressedgenes(sampleSizefunction)andto138
detectdifferentiallyexpressedgenes/transcripts(wrapperDeseq2function).Samplesize/statistical139
powerestimationoftheexperimentanddifferentialexpressionarecalculatedrespectivelyvia140
RnaSeqSampleSize[14]andDESeq2Bioconductorpackages[15].141
Inthedevelopmentbranch,themaineffortofthecoredevelopersisfocusedinproviding142
workflowsforDNAandRNAsomaticvariantcalling.TheDNAvariantcallingworkflowembedsthe143
pre-processingproceduresuggestedbytheGATKbestpractice(Fig.3A).RNAseqdatapreparation144
forvariantcalling(Fig.3C)requirestheuseofSTAR2stepprocedure[16],whichprovides145
significantlyincreasedsensitivitytonovelsplicejunctions.Then,aftersortingandduplicates146
marking,OPOSSUM[17]isusedtoremoveintronicregionsandtomergeoverlappingreads.We147
havealsoimplementedaspecificprocedure(Fig.3B),basedonxenomesoftware[18],to148
discriminatebetweenhumanreadsandmousehostreadsinthesequencesproducedbythe149
analysisofpatientsderivedxenografts(PDX,[19]).Aspartofthesomaticvariantcallingworkflow150
weareimplementingMUTECT1and2[20](Fig.4A)tocallsomaticvariantsaswellasPLATYPUS151
[21]forextractinginformationofjoined-samplesSNVs(Fig.4B).152
WearealsoexpandingtheRNAseqmoduleaddingthereference-freeSalmonaligner[22],which153
employslessmemoryforthealignmenttaskthanSTAR,butprovidingsimilarresults[23].154
Finally,HashCloneframework(AcceptedforpublicationinBMCBioinformatics),anewsuiteof155
bioinformaticstoolsprovidingB-cellsclonalityassessmentandminimalresidualdisease(MRD)156
monitoringovertimefromdeepsequencingdata,wasintegratedintheDocker4seqpackage.In157
particular,aparallelversionofthestandardHashCloneworkflow(Fig.5)wasdevelopedexploiting158
thedockerarchitecture.159
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
8
Allthemodulesdescribedaboveareimplementedin18dockerimagesdepositedinthedocker160
hub(https://hub.docker.com/u/repbioinfo/).161
AspartoftheRBPwehavealsodevelopedaGUI,4SeqGUI162
(https://github.com/mbeccuti/4SeqGUI).TheGUIisimplementedinJAVAandcanbeexploitedto163
performwholetranscriptomesequencingworkflow(Fig.2A),ChIPsequencingworkflow(Fig.2B),164
andmiRNAsequencingworkflow(Fig.2C).165
Discussion166
Bioinformaticsworkflowsarebecominganessentialpartofmanyresearchpapers.However,167
absenceofclearandwell-definedrulesonthecodedistributionmaketheresultsofmost168
publishedresearchesunreproducible[24].Recently,Almugbelandcoworkers[25]describedan169
interestinginfrastructuretoembedBioconductorbasedpackages.However,Bioconductordoes170
notcoverallstepsofanypossiblebioinformaticsworkflow,thusprovidingalimitedframeworkfor171
developingcomplexpipelines.Differently,RBPrepresentsanewinstrument,whichexpandsthe172
ideaofAlmugbel[25],providingamoreflexibleinfrastructureallowingthebioinformatics173
communitytospreadtheirworkundertheguidanceofrules,whichguaranteeinter-laboratory174
reproducibilityanddonotlimitdockerimplementationstoBioconductorpackages.RBPcore175
developerscreatedframeworksforRNA/miRNAquantificationandanalysis.ChIPseqworkflowwas176
alsodevelopedandvariantcallingworkflowsforDNAandRNAareunderactivedevelopment.A177
peculiarfeatureofRBPistheacceptanceofdemonstrativeworkflows,i.e.bioinformatics178
proceduresdescribedinabiological/medicalpaper.Ademonstrativeworkflowiswrappedina179
dockerimageanditissupportedbyatutorial,whichdescribesstepbystephowtheanalysisis180
donetoguaranteethereproducibilityofpublisheddata. 181
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
9
Availabilityandrequirements182
Projectname:ReproducibleBioinformaticsProject183
Projecthomepage:http://reproducible-bioinformatics.org184
Operatingsystem:UNIX-like185
Programminglanguage:R186
Otherrequirements:dockerversion17.05.0-ceorhigher187
License:GPL.188
189
Declarations190
Competinginterests191
None192
193
Funding194
ThisworkhasbeensupportedbytheEPIGENFLAGPROJECT195
196
Authors'contributions197
NKandLAequallycontributedtothedevelopmentofmiRNAworkflowandalltheothertools.RP198
andFCdevelopedtheRNAseqworkflowandrefinedtheChIPseqworkflow.MAandMO199
performedapplicationstesting.MBandRACdevelopedtherulestosubmittoolsandworkflowsto200
theReproducibleBioinformaticscommunity.RACandMBequallysupervisedtheoverallwork.201
202
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
10
Figurescaption203204Figure1:ReproducibleBioinformaticsProjectstructure.205
206
Figure2:Workflowsavailableinthestablebranchofdocker4seq.A)Wholetranscriptome207
sequencingworkflow,B)ChIPsequencingworkflow,andC)miRNAsequencingworkflow.The208
namesfollowedbyparenthesisarethedocker4seqfunctionsusedtoexecutetheanalysissteps.209
Blackindicateelementsincommonamongmorethanoneworkflow.210
211
Figure3:Variantcallingworkflowsunderrefinementinthedevelopmentbranchofdocker4seq.212
A)SNVscallinginDNAworkflow.ThefunctionsnvPreprocessingrequiresthatusersprovidesits213
owncopyoftheGATKsoftware,becauseofBroadInstitutelicenserestrictions.Thisfunction214
returnsabamfilesorted,withduplicatesmarkedafterGATKindelrealignmentandquality215
recalibration.B)DatapreprocessingforsamplesderivedbyPatientDerivedXenografths(PDX).216
Thexenomefunctiondiscriminatesbetweenthemousehostreadsandthehumantumorreads,217
thenDNAorRNASNVcallingworkflowscanbeapplied.C)SNVscallinginRNAworkflow.The218
functionstar2stepsgeneratesasortedbam,whereduplicatesaremarkedandprocessedby219
opossumforremovalofintronicregionsandmergingofoverlappingreads.Thenamesfollowedby220
parenthesisarethedocker4seqfunctionsusedtoexecutetheanalysissteps.Blackindicate221
elementsincommonbetweenmorethanoneworkflow.222
223
Figure4:Variantcallingworkflowsunderdevelopmentinthedevelopmentbranchof224
docker4seq.A)SomaticSNVsdetectionusingGATKMUTECT1or2.B)Platypusbasedjoin225
mutationscaller.Dashedblocksarenotimplemented,yet.226
227
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
11
Figure5:HashClonepipeline.TheHashClonestrategyisorganizedinthreesteps:228
Thefirststep(redbox)isusedtodetectk-merinallpatients’samples.Thesecondstep(green229
box)focusonthegenerationofsequencesignaturesleadingtotheidentificationofthesetof230
putativeclonespresentineachofthepatients’sample;thethirdstep(bluebox)isusedtothe231
characterizationandevaluationofthecancerclones.232
233
References234
1. BakerM:1,500scientistsliftthelidonreproducibility.Nature2016,533(7604):452-454.235
2. LithgowGJ,DriscollM,PhillipsP:Alongjourneytoreproducibleresults.Nature2017,236
548(7668):387-388.237
3. SearlsDB:Therootsofbioinformatics.PLoScomputationalbiology2010,6(6):e1000809.238
4. KanwalS,KhanFZ,LonieA,SinnottRO:Investigatingreproducibilityandtracking239
provenance-Agenomicworkflowcasestudy.BMCbioinformatics2017,18(1):337.240
5. SandveGK,NekrutenkoA,TaylorJ,HovigE:Tensimplerulesforreproducible241
computationalresearch.PLoScomputationalbiology2013,9(10):e1003285.242
6. GentlemanRC,CareyVJ,BatesDM,BolstadB,DettlingM,DudoitS,EllisB,GautierL,GeY,243
GentryJetal:Bioconductor:opensoftwaredevelopmentforcomputationalbiologyand244
bioinformatics.Genomebiology2004,5(10):R80.245
7. ColomboAR,J.TricheTJ,RamsinghG:Arkas:RapidreproducibleRNAseqanalysis.246
F1000Res2017,6:586.247
8. VanNesteC,GansemansY,DeConinckD,VanHoofstatD,VanCriekingeW,DeforceD,Van248
NieuwerburghF:Forensicmassivelyparallelsequencingdataanalysistool:249
ImplementationofMyFLqasastandaloneweb-andIlluminaBaseSpace((R))-application.250
ForensicSciIntGenet2015,15:2-7.251
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
12
9. DiganW,CountourisH,BarritaultM,BaudoinD,Laurent-PuigP,BlonsH,BurgunA,Rance252
B:AnArchitectureforGenomicsAnalysisinaClinicalSettingUsingGalaxyandDocker.253
Gigascience2017.254
10. DoveES,JolyY,TasseAM,PublicPopulationProjectinG,SocietyInternationalSteeringC,255
InternationalCancerGenomeConsortiumE,PolicyC,KnoppersBM:Genomiccloud256
computing:legalandethicalpointstoconsider.Europeanjournalofhumangenetics:257
EJHG2015,23(10):1271-1278.258
11. daVeigaLeprevostF,GruningBA,AlvesAflitosS,RostHL,UszkoreitJ,BarsnesH,VaudelM,259
MorenoP,GattoL,WeberJetal:BioContainers:anopen-sourceandcommunity-driven260
frameworkforsoftwarestandardization.Bioinformatics2017,33(16):2580-2582.261
12. KimB,AliT,LijeronC,AfganE,KrampisK:Bio-Docklets:virtualizationcontainersfor262
single-stepexecutionofNGSpipelines.Gigascience2017,6(8):1-7.263
13. MenegidioFB,JabesDL,CostadeOliveiraR,NunesLR:Dugong:aDockerimage,basedon264
UbuntuLinux,focusedonreproducibilityandreplicabilityforbioinformaticsanalyses.265
Bioinformatics2017.266
14. ChingT,HuangS,GarmireLX:PoweranalysisandsamplesizeestimationforRNA-Seq267
differentialexpression.RNA2014,20(11):1684-1696.268
15. LoveMI,HuberW,AndersS:Moderatedestimationoffoldchangeanddispersionfor269
RNA-seqdatawithDESeq2.Genomebiology2014,15(12):550.270
16. DobinA,DavisCA,SchlesingerF,DrenkowJ,ZaleskiC,JhaS,BatutP,ChaissonM,Gingeras271
TR:STAR:ultrafastuniversalRNA-seqaligner.Bioinformatics2013,29(1):15-21.272
17. OikkonenL,LiseS:MakingthemostofRNA-seq:Pre-processingsequencingdatawith273
OpossumforreliableSNPvariantdetection.WellcomeOpenRes2017,2:6.274
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
13
18. ConwayT,WaznyJ,BromageA,TymmsM,SoorajD,WilliamsED,Beresford-SmithB:275
Xenome--atoolforclassifyingreadsfromxenograftsamples.Bioinformatics2012,276
28(12):i172-178.277
19. SiolasD,HannonGJ:Patient-derivedtumorxenografts:transformingclinicalsamplesinto278
mousemodels.Cancerresearch2013,73(17):5315-5319.279
20. CibulskisK,LawrenceMS,CarterSL,SivachenkoA,JaffeD,SougnezC,GabrielS,Meyerson280
M,LanderES,GetzG:Sensitivedetectionofsomaticpointmutationsinimpureand281
heterogeneouscancersamples.Naturebiotechnology2013,31(3):213-219.282
21. RimmerA,PhanH,MathiesonI,IqbalZ,TwiggSRF,ConsortiumWGS,WilkieAOM,McVean283
G,LunterG:Integratingmapping-,assembly-andhaplotype-basedapproachesforcalling284
variantsinclinicalsequencingapplications.Naturegenetics2014,46(8):912-918.285
22. PatroR,DuggalG,LoveMI,IrizarryRA,KingsfordC:Salmonprovidesfastandbias-aware286
quantificationoftranscriptexpression.Naturemethods2017,14(4):417-419.287
23. ZhangC,ZhangB,LinLL,ZhaoS:Evaluationandcomparisonofcomputationaltoolsfor288
RNA-seqisoformquantification.BMCgenomics2017,18(1):583.289
24. HothornT,LeischF:Casestudiesinreproducibility.Briefingsinbioinformatics2011,290
12(3):288-300.291
25. AlmugbelR,HungLH,HuJ,AlmutairyA,OrtogeroN,TamtaY,YeungKY:Reproducible292
Bioconductorworkflowsusingbrowser-basedinteractivenotebooksandcontainers.JAm293
MedInformAssoc2017.294
295
Tables296297
Table1:Goodpracticebioinformaticsrules,derivedfromSandveetal.[5]298
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
14
1 ForEveryResult,KeepTrackofHowItWasProduced
2 AvoidManualDataManipulationSteps
3 ArchivetheExactVersionsofAllExternalProgramsUsed
4 VersionControlAllCustomScripts
5 RecordAllIntermediateResults,WhenPossibleinStandardizedFormats
6 ForAnalysesThatIncludeRandomness,NoteUnderlyingRandomSeeds
7 AlwaysStoreRawDatabehindPlots
8 GenerateHierarchicalAnalysisOutput,AllowingLayersofIncreasingDetailtoBe
Inspected
9 ConnectTextualStatementstoUnderlyingResults
10 ProvidePublicAccesstoScripts,Runs,andResults
299
300
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
15
Figures301302
303
Figure1304
305
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
16
306
307
Figure2308
309
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
17
310
311
Figure3312
313
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
18
314
315
Figure4316
317
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;
19
318
319
Figure5320
321
All rights reserved. No reuse allowed without permission. (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint. http://dx.doi.org/10.1101/239947doi: bioRxiv preprint first posted online Dec. 26, 2017;