2014 nicta-reproducibility

download 2014 nicta-reproducibility

of 66

  • date post

    27-Nov-2014
  • Category

    Science

  • view

    316
  • download

    1

Embed Size (px)

description

talk at NICTA on reproducibility

Transcript of 2014 nicta-reproducibility

  • 1. Openness and reproducibilityin computational science:tools, approaches, andthought patterns.C. Titus Brownctb@msu.eduOctober 16, 2014

2. Hello!Assistant Professor @ MSU; Microbiology; ComputerScience; etc.=> UC Davis VetMed in 2015.More information at: ged.msu.edu/ github.com/ged-lab/ ivory.idyll.org/blog/ @ctitusbrown 3. The challenges of non-modelsequencing Missing or low quality genome reference. Evolutionarily distant. Most extant computational tools focus on modelorganisms o Assume low polymorphism (internal variation)o Assume reference genomeo Assume somewhat reliable functional annotationo More significant compute infrastructureand cannot easily or directly be used on critters of interest. 4. Shotgun sequencing & assemblyhttp://eofdreams.com/library.html;http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/ 5. Shotgun sequencinganalysis goals: Assembly (what is the text?)o Produces new genomes & transcriptomes.o Gene discovery for enzymes, drug targets, etc. Counting (how many copies of each book?)o Measure gene expression levels, protein-DNAinteractions Variant calling (how does each edition vary?)o Discover genetic variation: genotyping, linkagestudieso Allele-specific expression analysis. 6. AssemblyIt was the best of times, it was the wor, it was the worst of times, it was theisdom, it was the age of foolishnessmes, it was the age of wisdom, it was thIt was the best of times, it was the worst of times, it wasthe age of wisdom, it was the age of foolishnessbut for lots and lots of fragments! 7. Shared low-levelfragments maynot reach thethreshold forassembly.Lamprey mRNAseq: 8. Assembly graphs scale with data size, notinformation.Conway T C , Bromage A J Bioinformatics 2011;27:479-486 The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,please email: journals.permissions@oup.com 9. Practical memorymeasurements (soil)Velvet measurements (Adina Howe) 10. Data set size and cost $1000 gets you ~200m reads, or about 20-80 GB ofdata, in ~week. > 1000 labs doing this regularly. Each data set analysis is ~custom. Analyses are data intensive and memory intensive. 11. Efficient data structures &algorithmsEfficient onlinecounting of k-mersTrimming readson abundanceEfficient DeBruijn graphrepresentationsReadabundancenormalization 12. Raw data(~10-100 GB) Analysis"Information"~1 GBShotgun sequencing is massively redundant; can weeliminate redundancy while retaining information?Analog: JPEG lossy compression"Information""Information""Information""Information"Database &integrationCompression(~2 GB) 13. Sparse collections of k-mers can bestored efficiently in Bloom filtersPell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109 14. Data structures &algorithms papers These are not the k-mers you are looking for,Zhang et al., PLoS One, 2014. Scaling metagenome sequence assembly withprobabilistic de Bruijn graphs, Pell et al., PNAS 2012. A Reference-Free Algorithm for ComputationalNormalization of Shotgun Sequencing Data, Brownet al., arXiv 1203.4802. 15. Data analysis papers Tackling soil diversity with the assembly of large,complex metagenomes, Howe et al., PNAS, 2014. Assembling novel ascidian genomes &transcriptomes, Stolfi et al. (eLife 2014), Lowe et (inprep) A de novo lamprey transcriptome from large scalemulti-tissue mRNAseq, Scott et al., in prep. 16. Lab approach notintentional, but working out.Novel datastructures andalgorithmsImplement atscaleApply to realbiologicalproblems 17. This leads to good things.Efficient onlinecounting of k-mersTrimming readson abundanceEfficient DeBruijn graphrepresentations(khmer software)Readabundancenormalization 18. Efficient onlinecounting of k-mersTrimming readson abundanceEfficient DeBruijn graphrepresentationsReadabundancenormalizationStreamingalgorithms forassembly,variant calling,and errorcorrectionEfficient graphlabeling &explorationCloud assemblyprotocolsEfficient searchfor target genesData setpartitioningapproachesAssembly-freecomparison ofdata setsHMM-guidedassemblyCurrent research(khmer software) 19. Testing & version control the not so secret sauce High test coverage - grown over time. Stupidity driven testing we write tests for bugs afterwe find them and before we fix them. Pull requests & continuous integration does yourproposed merge break tests? Pull requests & code review does new code meetour minimal coding etc requirements?o Note: spellchecking!!! 20. Our novel research enablesthis: Novel data structures and algorithms; Permit low(er) memory data analysis; Liberate analyses from specialized hardware. 21. Running entirely w/in cloud~40 hoursComplete data; AWS m1.xlarge(See PyCon 2014 talk; video and blog post.)MEMORY 22. On the novel research side: Novel data structures and algorithms; Permit low(er) memory data analysis; Liberate analyses from specialized hardware.This last bit? => reproducibility. 23. Reproducibility!Scientific progress relies on reproducibility ofanalysis. (Aristotle, Nature, 322 BCE.)There is no such thing as reproducible science.There is only science, and not science. someone on Twitter (Fernando Perez?) 24. DisclaimerNot a researcher of reproducibility!Merely a practitioner.Please take my points below as an argumentand not as research conclusions.(But Im right.) 25. Replication vsreproducibility I will not clearly distinguish. There are important differences.o Replication: someone using same data, same tools, => same resultso Reproduction: someone using different data and/or different tools =>same result. The former is much easier. The latter is much stronger. Science is failing even mere replication!? So, mostly I will talk about how we make ouranalyses replicable. 26. My usual intro:We practice open science!Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/research.html Preprints available.Everything is > 80% reproducible. 27. My usual intro:We practice open science!Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site:http://ged.msu.edu/research.html Preprints available.Everything is > 80% reproducible. 28. My lab & the diginorm paper. All our code was already on github; Much of our data analysis was already in the cloud; Our figures were already made in IPython Notebook Our paper was already in LaTeX 29. IPython Notebook: data +IPythcoond)Ne o=t>ebook) 30. My lab & the diginorm paper. All our code was already on github; Much of our data analysis was already in the cloud; Our figures were already made in IPython Notebook Our paper was already in LaTeXwhy not push a bit more and make it easilyreproducible?This involved writing a tutorial. And thats it. 31. To reproduce our paper:git clone && python setup.py installgit clone cd pipelinewget && tar xzf make && cd ../notebook && makecd ../ && make 32. Now standard in lab --Our papers now have: Source hosted on github; Data hosted there or on AWS; Long running data analysis =>make Graphing and data digestion=> IPython Notebook (also ingithub)Qingpeng Zhang 33. Research processGenerate newresults; encodein MakefileSummarize inIPythonNotebookDiscuss, explore Push to github 34. Literate graphing &interactive exploration 35. The process We start with pipeline reproducibility Baked into lab culture; default use git; write scriptsCommunity of practice! Use standard open source approaches, so OSSdevelopers learn it easily. Enables easy collaboration w/in lab Valuable learning tool! 36. Growing & refining theprocess Now moving to Ubuntu Long-Term Support + installinstructions. Everything is as automated as is convenient. Students expected to communicate with me in IPythonNotebooks. Trying to avoid building (or even using) new repro tools. Avoid maintenance burden as much as possible. 37. 1. Use standard OS; provideinstall instructions Providing install, execute for Ubuntu Long-TermSupport release 14.04: supported through 2017 andbeyond. Avoid pre-configured virtual machines! They:o Lock you into specific cloud homes.o Challenge remixability and extensibility. 38. 2. Automate Literate graphing now easy with knitr and IPythonNotebook. Build automation with make, or whatever. To firstorder, it does not matter what tools you use. Explicit is better than implicit. Make it easy tounderstand what youre doing and how to extend it. 39. k-mer counting paper(Ubuntu 14.04, git, make, IPython Notebook, latex) 40. Time from publication of KAnalyze to our100% reproducible re-evaluation? ~8 hours. 41. 3. Protocols, not pipelines.STOP HIDING THE ANALYSIS STEPS. 42. Write down what youredoinghttps://khmer-protocols.readthedocs.org/ 43. and add automatedend-to-end tests.c.f. literate ReSTing 44. 4. Drive sustainable softwaredevelopment with use cases. 45. that are explicit 46. versioned 47. and automated. 48. 5. Invest in automated, reproducibleworkflowsGenome ReferenceQuality Filtered Diginorm Partition ReinflationVelvet - 80.90 83.64 84.57IDBA 90.96 91.38 90.52 88.80SPAdes 90.42 90.35 89.57 90.02Mis-assembled Contig LengthVelvet - 52071358 44730449 45381867IDBA 21777032 20807513 17159671 18684159SPAdes 28238787 21506019 14247392 18851571Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013Also! Tip o the hat to Michael Barton, nucleotid.es 49. Automation enables superfun paper reviews! What a nice new transcriptome assembler! Interestinghow it doesnt perform that well on my 10 test data sets. Hey, so you make these claims, but I ran your code,and Fun f