2014 manchester-reproducibility

download 2014 manchester-reproducibility

of 58

Embed Size (px)

description

 

Transcript of 2014 manchester-reproducibility

  • Six ways to Sunday: approaches to computational reproducibility in non-model system sequence analysis. C. Titus Brown ctb@msu.edu May 21, 2014
  • Hello! Assistant Professor; Microbiology; Computer Science; etc. More information at: ged.msu.edu/ github.com/ged-lab/ ivory.idyll.org/blog/ @ctitusbrown
  • The challenges of non- model sequencing Missing or low quality genome reference. Evolutionarily distant. Most extant computational tools focus on model organisms o Assume low polymorphism (internal variation) o Assume reference genome o Assume somewhat reliable functional annotation o More significant compute infrastructure and cannot easily or directly be used on critters of interest.
  • Shotgun sequencing & assembly http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • Shotgun sequencing analysis goals: Assembly (what is the text?) o Produces new genomes & transcriptomes. o Gene discovery for enzymes, drug targets, etc. Counting (how many copies of each book?) o Measure gene expression levels, protein-DNA interactions Variant calling (how does each edition vary?) o Discover genetic variation: genotyping, linkage studies o Allele-specific expression analysis.
  • Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness but for lots and lots of fragments!
  • Shared low-level fragments may not reach the threshold for assembly. Lamprey mRNAseq:
  • Introducing k-mers CCGATTGCACTGGACCGA ( 1000 labs doing this regularly. Each data set analysis is ~custom. Analyses are data intensive and memory intensive.
  • Efficient data structures & algorithms Efcient online counting of k-mers Trimming reads on abundance Efcient De Bruijn graph representations Read abundance normalization
  • Shotgun sequencing is massively redundant; can we eliminate redundancy while retaining information? Analog: JPEG lossy compression Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB)
  • Sparse collections of k-mers can be stored efficiently in Bloom filters Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109
  • Data structures & algorithms papers These are not the k-mers you are looking for, Zhang et al., arXiv 1309.2975, in review. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs, Pell et al., PNAS 2012. A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data, Brown et al., arXiv 1203.4802, under revision.
  • Data analysis papers Tackling soil diversity with the assembly of large, complex metagenomes, Howe et al., PNAS, 2014. Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep. A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
  • Lab approach not intentional, but working out. Novel data structures and algorithms Implement at scale Apply to real biological problems
  • This leads to good things. Efcient online counting of k-mers Trimming reads on abundance Efcient De Bruijn graph representations Read abundance normalization (khmer software)
  • Efcient online counting of k-mers Trimming reads on abundance Efcient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efcient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efcient search for target genes Currentresearch (khmer software)
  • Testing & version control the not so secret sauce High test coverage - grown over time. Stupidity driven testing we write tests for bugs after we find them and before we fix them. Pull requests & continuous integration does your proposed merge break tests? Pull requests & code review does new code meet our minimal coding etc requirements? o Note: spellchecking!!!
  • On the novel research side: Novel data structures and algorithms; Permit low(er) memory data analysis; Liberate analyses from specialized hardware.
  • Running entirely w/in cloud Complete data; AWS m1.xlarge ~40 hours (See PyCon 2014 talk; video and blog post.) MEMORY
  • On the novel research side: Novel data structures and algorithms; Permit low(er) memory data analysis; Liberate analyses from specialized hardware. This last bit? => reproducibility.
  • Reproducibility! Scientific progress relies on reproducibility of analysis. (Aristotle, Nature, 322 BCE.) There is no such thing as reproducible science. There is only science, and not science. someone on Twitter (Fernando Perez?)
  • Disclaimer Not a researcher of reproducibility! Merely a practitioner. Please take my points below as an argument and not as research conclusions. (But Im right.)
  • My usual intro: We practice open science! Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/research.html Preprints available. Everything is > 80% reproducible.
  • My usual intro: We practice open science! Everything discussed here: Code: github.com/ged-lab/ ; BSD license Blog: http://ivory.idyll.org/blog (titus brown blog) Twitter: @ctitusbrown Grants on Lab Web site: http://ged.msu.edu/research.html Preprints available. Everything is > 80% reproducible.
  • My lab & the diginorm paper. All our code was already on github; Much of our data analysis was already in the cloud; Our figures were already made in IPython Notebook Our paper was already in LaTeX
  • IPython Notebook: data + code =>IPython)Notebook)
  • My lab & the diginorm paper. All our code was already on github; Much of our data analysis was already in the cloud; Our figures were already made in IPython Notebook Our paper was already in LaTeX why not push a bit more and make it easily reproducible? This involved writing a tutorial. And thats it.
  • To reproduce our paper: git clone && python setup.py install git clone cd pipeline wget && tar xzf make && cd ../notebook && make cd ../ && make
  • Now standard in lab -- All our papers now have: Source hosted on github; Data hosted there or on AWS; Long running data analysis => make Graphing and data digestion => IPython Notebook (also in github) Qingpeng Zhang
  • Research process Generate new results; encode in Makele Summarize in IPython Notebook Push to githubDiscuss, explore
  • Literate graphing & interactive exploration
  • The process We start with pipeline reproducibility Baked into lab culture; default use git; write scripts Community of practice! Use standard open source approaches, so OSS developers learn it easily. Enables easy collaboration w/in lab Valuable learning tool!
  • Growing & refining the process Now moving to Ubuntu Long-Term Support + install instructions. Everything is as automated as is convenient. Students expected to communicate with me in IPython Notebooks. Trying to avoid building (or even using) new tools. Avoid maintenance burden as much as possible.
  • 1. Use standard OS; provide install instructions Providing install, execute for Ubuntu Long-Term Support release 14.04: supported through 2017 and beyond. Avoid pre-configured virtual machines! o Locks you into specific cloud homes. o Ch