2014 toronto-torbug

Building khmer, a platform for research in

scalable sequence analysisC. Titus Brownctb@msu.edu

Hello!Assistant Professor; Microbiology; Computer

Science; etc.

More information at:

• ged.msu.edu/• github.com/ged-lab/• ivory.idyll.org/blog/• @ctitusbrown

Introducing k-mers

CCGATTGCACTGGACCGA (<- read)

CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCG

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTGGACCGATGCACGGTACCGCATGGACCGATTGCACTGGACCGATGCACGGACCG

(with no accounting for mismatches or indels)

De Bruijn graphs – assemble on overlaps

J.R. Miller et al. / Genomics (2010)

The problem with k-mers

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCCCATGGACCGATTGCACTCGACCGATGCACGGTACCG

Each sequencing error results in k novel k-mers!

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

Assembly graphs scale with data size, not

information.

Practical memory measurements (soil)

Velvet measurements (Adina Howe)

Counting k-mers efficiently (RAM)

This leads to good things.

Data structures & algorithms papers

• “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review.

• “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.

• “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.

Data analysis papers• “Tackling soil diversity with the assembly of large,

complex metagenomes”, Howe et al., PNAS, 2014.

• Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep.

• A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.

Lab approach – not intentional, but working

This leads to good things.

(khmer software)

h(khmer software)

How is this feasible?!

Representative half-arsed lab software development

A not-insane way to do software development

Testing & version control – the not so

secret sauce• High test coverage - grown over time.

• Stupidity driven testing – we write tests for bugs after we find them and before we fix them.

• Pull requests & continuous integration – does your proposed merge break tests?

• Pull requests & code review – does new code meet our minimal coding etc requirements?o Note: spellchecking!!!

Integration testing• khmer is designed to work with other packages.

• For releases >= 1.0, we now have added acceptance tests to make sure that khmer works OK with other packages.

• These acceptance tests are based on integration tests, than in turn come from an education & documentation effort…

khmer-protocols

khmer-protocols:• Provide standard “cheap”

assembly protocols for the cloud.

• Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers)

• Open, versioned, forkable, citable….

Literate testing• Our shell-command tutorials for bioinformatics

can now be executed in an automated fashion – commands are extracted automatically into shell scripts.

• See: github.com/ged-lab/literate-resting/.

• Tremendously improves peace of mind and confidence moving forward!

Leigh Sheneman

Doing things right=> #awesomesauce

Benchmarking protocols

Data subset; AWS m1.xlarge

~1 hour

(See PyCon 2014 talk; video and blog post.)

Benchmarking protocols

Complete data; AWS m1.xlarge

~40 hours

(See PyCon 2014 talk; video and blog post.)

Genomic intervals shared between data

Qingpeng Zhang

* Assembly free!

Error correction via graph alignment

Jason Pell and Jordan Fish

Error correction on simulated E. coli data

1% error rate, 100x coverage.

Jordan Fish and Jason Pell

TP FP TN FN

ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9%

1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2%

1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8%

(corrected) (mistakes) (OK) (missed)

Single pass, reference free, tunable, streaming online variant calling.

Streaming, online variant calling.

See NIH BIG DATA grant, http://ged.msu.edu/.

Novelty… to what power?

• “Novelty” requirements for “high impact publishing”:o Must do novel algorithm developmento …and apply to novel and interesting data sets.o (See Josh Bloom, https://medium.com/tech-talk/dd88857f662)

• We’ve taken on the additional challenge of trying to develop and maintain a core set of functionality in research software: novelty cubed? :)

ReproducibilityScientific progress relies on reproducibility of

analysis. (Aristotle, Nature, 322 BCE.)

All our papers now have:

• Source hosted on github;• Data hosted there or on

AWS;• Long running data

analysis => ‘make’• Graphing and data

digestion => IPython Notebook (also in github)

Qingpeng Zhang

Concluding thoughts• API is destiny – without online counting,

diginorm & streaming approaches would not have been possible.

• Tackle the hard problems – engineering optimization would not have gotten us very far.

• Testing lets us scale development & process – which means when something works, we can run with it.

Caveats• Expense and effort – you can spend an infinite

amount of time on infrastructure & process!o Advice: choose techniques that address actual pain points.o (See: “Best Practices in Scientific Computing”, Wilson et al., 2014)

• Funders and reviewers just don’t care – adopt good software practices for yourself, not others.o Advice: briefly mention keywords in grants, papers.

• Advisors just don’t care – see above.o These are 90% true statements :>

Can we crowdsource bioinformatics?

We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of

“It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?”

- http://thescienceweb.wordpress.com/2014/02/21/bioinformatics-software-companies-have-no-clue-why-no-one-buys-their-

products/

Thanks!

Prospective: sequencing tumor cells

• Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.

• 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.

• Most of this data will be redundant and not useful.

• Developing diginorm-based algorithms to eliminate data while retaining variant information.

Where are we taking this?

• Streaming online algorithms only look at data ~once.

• Diginorm is streaming, online…

• Conceptually, can move many aspects of sequence analysis into streaming mode.

=> Extraordinary potential for computational efficiency.

2014 toronto-torbug

Science

Transcript of 2014 toronto-torbug

Russian Toronto Business Directory 2014

Morale and Welfare Presentation to Toronto Reservists Toronto – 16 January 2014

Toronto office market report q1 2014

Market Watch TORONTO 2014 NOVEMBER

Selfish Accessibility: a11y Camp Toronto 2014

Ethereum Toronto Meetup September 3, 2014

Welcome To Toronto 2014

BUILD TORONTO Request for Proposal 2014-012... · BUILD TORONTO Request for Proposal RFP 2014-012 Responsive Web Design Services BUILD TOROTO INC Toronto, Ontario October 9, 2014

Wireframe Secrets Revealed, WordCamp Toronto 2014

ACFE Toronto Newsletter May 2014

Content Architecture - WordCamp Toronto 2014

2014 Toronto Water Operating Analyst Notes - City of Toronto

Market Watch Toronto 2014 APRIL

IGLA 2016 TORONTO IGLA 2014 TORONTO · IGLA 2014 TORONTO Bid Package ... July 16, 2014 IGLA 2016 TORONTO . ... the Niagara Falls, one of the world’s natural wonders, the Golden

TORONTO DESIGNATED EARLY CHILDHOOD EDUCATORS · toronto catholic district school board ... 2014 -august 31, 2017) 2014/2017 note: ... toronto designated early childhood educators

Toronto office market report q3 2014

F5 Synthesis Toronto February 2014 Roadshow

2014 Toronto Public Health Student Survey: Research Methods€¦ · The 2014 Student Survey: Research Methods | Toronto Public Health 6 . years living in Toronto, representing 11%

SWISS FILMS in Toronto 2014

Selfish Accessibility: WordCamp Toronto 2014