2014 toronto-torbug

40
Building khmer, a platform for research in scalable sequence analysis C. Titus Brown [email protected]

Transcript of 2014 toronto-torbug

Page 1: 2014 toronto-torbug

Building khmer, a platform for research in

scalable sequence analysisC. Titus [email protected]

Page 2: 2014 toronto-torbug

Hello!Assistant Professor; Microbiology; Computer

Science; etc.

More information at:

• ged.msu.edu/• github.com/ged-lab/• ivory.idyll.org/blog/• @ctitusbrown

Page 3: 2014 toronto-torbug

Introducing k-mers

CCGATTGCACTGGACCGA (<- read)

CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA

Page 4: 2014 toronto-torbug

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCG

Page 5: 2014 toronto-torbug

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCGCATGGACCGATTGCACTGGACCGATGCACGGACCG

(with no accounting for mismatches or indels)

Page 6: 2014 toronto-torbug

De Bruijn graphs – assemble on overlaps

J.R. Miller et al. / Genomics (2010)

Page 7: 2014 toronto-torbug

The problem with k-mers

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTCGACCGATGCACGGTACCG

Each sequencing error results in k novel k-mers!

Page 8: 2014 toronto-torbug

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Assembly graphs scale with data size, not

information.

Page 9: 2014 toronto-torbug

Practical memory measurements (soil)

Velvet measurements (Adina Howe)

Page 10: 2014 toronto-torbug

Counting k-mers efficiently (RAM)

Page 11: 2014 toronto-torbug

This leads to good things.

Page 12: 2014 toronto-torbug

Data structures & algorithms papers

• “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review.

• “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.

• “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.

Page 13: 2014 toronto-torbug

Data analysis papers• “Tackling soil diversity with the assembly of large,

complex metagenomes”, Howe et al., PNAS, 2014.

• Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep.

• A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.

Page 14: 2014 toronto-torbug

Lab approach – not intentional, but working

out.

Page 15: 2014 toronto-torbug

This leads to good things.

(khmer software)

Page 16: 2014 toronto-torbug

Cu

rren

t re

searc

h(khmer software)

Page 17: 2014 toronto-torbug

How is this feasible?!

Representative half-arsed lab software development

Page 18: 2014 toronto-torbug

A not-insane way to do software development

Page 19: 2014 toronto-torbug

A not-insane way to do software development

Page 20: 2014 toronto-torbug

Testing & version control – the not so

secret sauce• High test coverage - grown over time.

• Stupidity driven testing – we write tests for bugs after we find them and before we fix them.

• Pull requests & continuous integration – does your proposed merge break tests?

• Pull requests & code review – does new code meet our minimal coding etc requirements?o Note: spellchecking!!!

Page 21: 2014 toronto-torbug

Integration testing• khmer is designed to work with other packages.

• For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages.

• These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…

Page 22: 2014 toronto-torbug

khmer-protocols

Page 23: 2014 toronto-torbug

khmer-protocols:• Provide standard “cheap”

assembly protocols for the cloud.

• Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers)

• Open, versioned, forkable, citable….

Page 24: 2014 toronto-torbug

Literate testing• Our shell-command tutorials for bioinformatics

can now be executed in an automated fashion – commands are extracted automatically into shell scripts.

• See: github.com/ged-lab/literate-resting/.

• Tremendously improves peace of mind and confidence moving forward!

Leigh Sheneman

Page 25: 2014 toronto-torbug

Doing things right=> #awesomesauce

Page 26: 2014 toronto-torbug

Benchmarking protocols

Data subset; AWS m1.xlarge

~1 hour

(See PyCon 2014 talk; video and blog post.)

Page 27: 2014 toronto-torbug

Benchmarking protocols

Complete data; AWS m1.xlarge

~40 hours

(See PyCon 2014 talk; video and blog post.)

Page 28: 2014 toronto-torbug

Cu

rren

t re

searc

h

Page 29: 2014 toronto-torbug

Genomic intervals shared between data

sets

Qingpeng Zhang

* Assembly free!

Page 30: 2014 toronto-torbug

Error correction via graph alignment

Jason Pell and Jordan Fish

Page 31: 2014 toronto-torbug

Error correction on simulated E. coli data

1% error rate, 100x coverage.

Jordan Fish and Jason Pell

TP FP TN FN

ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%

1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%

1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%

(corrected) (mistakes) (OK) (missed)

Page 32: 2014 toronto-torbug

Single pass, reference free, tunable, streaming online variant calling.

Streaming, online variant calling.

See NIH BIG DATA grant, http://ged.msu.edu/.

Page 33: 2014 toronto-torbug

Novelty… to what power?

• “Novelty” requirements for “high impact publishing”:o Must do novel algorithm developmento …and apply to novel and interesting data sets.o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)

• We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)

Page 34: 2014 toronto-torbug

ReproducibilityScientific progress relies on reproducibility of

analysis. (Aristotle, Nature, 322 BCE.)

All our papers now have:

• Source hosted on github;• Data hosted there or on

AWS;• Long running data

analysis => ‘make’• Graphing and data

digestion => IPython Notebook (also in github)

Qingpeng Zhang

Page 35: 2014 toronto-torbug

Concluding thoughts• API is destiny – without online counting,

diginorm & streaming approaches would not have been possible.

• Tackle the hard problems – engineering optimization would not have gotten us very far.

• Testing lets us scale development & process – which means when something works, we can run with it.

Page 36: 2014 toronto-torbug

Caveats• Expense and effort – you can spend an infinite

amount of time on infrastructure & process!o Advice: choose techniques that address actual pain points.o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)

• Funders and reviewers just don’t care – adopt good software practices for yourself, not others.o Advice: briefly mention keywords in grants, papers.

• Advisors just don’t care – see above.o These are 90% true statements :>

Page 37: 2014 toronto-torbug

Can we crowdsource bioinformatics?

We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of

it!)

“It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?”

- http://thescienceweb.wordpress.com/2014/02/21/bioinformatics-software-companies-have-no-clue-why-no-one-buys-their-

products/

Page 38: 2014 toronto-torbug

Thanks!

Page 39: 2014 toronto-torbug

Prospective: sequencing tumor cells

• Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.

• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.

• Most of this data will be redundant and not useful.

• Developing diginorm-based algorithms to eliminate data while retaining variant information.

Page 40: 2014 toronto-torbug

Where are we taking this?

• Streaming online algorithms only look at data ~once.

• Diginorm is streaming, online…

• Conceptually, can move many aspects of sequence analysis into streaming mode.

=> Extraordinary potential for computational efficiency.