2014 toronto-torbug
-
Upload
ctitusbrown -
Category
Science
-
view
664 -
download
0
Transcript of 2014 toronto-torbug
Building khmer, a platform for research in
scalable sequence analysisC. Titus [email protected]
Hello!Assistant Professor; Microbiology; Computer
Science; etc.
More information at:
• ged.msu.edu/• github.com/ged-lab/• ivory.idyll.org/blog/• @ctitusbrown
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
K-mers give you an implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCG
K-mers give you an implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCGCATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
De Bruijn graphs – assemble on overlaps
J.R. Miller et al. / Genomics (2010)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
Assembly graphs scale with data size, not
information.
Practical memory measurements (soil)
Velvet measurements (Adina Howe)
Counting k-mers efficiently (RAM)
This leads to good things.
Data structures & algorithms papers
• “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.
Data analysis papers• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
Lab approach – not intentional, but working
out.
This leads to good things.
(khmer software)
Cu
rren
t re
searc
h(khmer software)
How is this feasible?!
Representative half-arsed lab software development
A not-insane way to do software development
A not-insane way to do software development
Testing & version control – the not so
secret sauce• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after we find them and before we fix them.
• Pull requests & continuous integration – does your proposed merge break tests?
• Pull requests & code review – does new code meet our minimal coding etc requirements?o Note: spellchecking!!!
Integration testing• khmer is designed to work with other packages.
• For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages.
• These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…
khmer-protocols
khmer-protocols:• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers)
• Open, versioned, forkable, citable….
Literate testing• Our shell-command tutorials for bioinformatics
can now be executed in an automated fashion – commands are extracted automatically into shell scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and confidence moving forward!
Leigh Sheneman
Doing things right=> #awesomesauce
Benchmarking protocols
Data subset; AWS m1.xlarge
~1 hour
(See PyCon 2014 talk; video and blog post.)
Benchmarking protocols
Complete data; AWS m1.xlarge
~40 hours
(See PyCon 2014 talk; video and blog post.)
Cu
rren
t re
searc
h
Genomic intervals shared between data
sets
Qingpeng Zhang
* Assembly free!
Error correction via graph alignment
Jason Pell and Jordan Fish
Error correction on simulated E. coli data
1% error rate, 100x coverage.
Jordan Fish and Jason Pell
TP FP TN FN
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%
(corrected) (mistakes) (OK) (missed)
Single pass, reference free, tunable, streaming online variant calling.
Streaming, online variant calling.
See NIH BIG DATA grant, http://ged.msu.edu/.
Novelty… to what power?
• “Novelty” requirements for “high impact publishing”:o Must do novel algorithm developmento …and apply to novel and interesting data sets.o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)
• We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)
ReproducibilityScientific progress relies on reproducibility of
analysis. (Aristotle, Nature, 322 BCE.)
All our papers now have:
• Source hosted on github;• Data hosted there or on
AWS;• Long running data
analysis => ‘make’• Graphing and data
digestion => IPython Notebook (also in github)
Qingpeng Zhang
Concluding thoughts• API is destiny – without online counting,
diginorm & streaming approaches would not have been possible.
• Tackle the hard problems – engineering optimization would not have gotten us very far.
• Testing lets us scale development & process – which means when something works, we can run with it.
Caveats• Expense and effort – you can spend an infinite
amount of time on infrastructure & process!o Advice: choose techniques that address actual pain points.o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)
• Funders and reviewers just don’t care – adopt good software practices for yourself, not others.o Advice: briefly mention keywords in grants, papers.
• Advisors just don’t care – see above.o These are 90% true statements :>
Can we crowdsource bioinformatics?
We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of
it!)
“It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?”
- http://thescienceweb.wordpress.com/2014/02/21/bioinformatics-software-companies-have-no-clue-why-no-one-buys-their-
products/
Thanks!
Prospective: sequencing tumor cells
• Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.
• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.
• Most of this data will be redundant and not useful.
• Developing diginorm-based algorithms to eliminate data while retaining variant information.
Where are we taking this?
• Streaming online algorithms only look at data ~once.
• Diginorm is streaming, online…
• Conceptually, can move many aspects of sequence analysis into streaming mode.
=> Extraordinary potential for computational efficiency.