2015 vancouver-vanbug

Building a platform for bioinformatics: some exciting

new directions for khmer.

C. Titus Brown

[email protected]

March 12, 2015

Hello!Associate Professor (#tenure!);

School of Veterinary Medicine

University of California, Davis.

More information at:

• ged.msu.edu/ ( URL needs to be updated :)

• github.com/ged-lab/

• ivory.idyll.org/blog/

• @ctitusbrown

WarningsThis talk contains information that may constitute

“forward-looking statements.” Generally, the words

“believe,” “expect,” “intend,” “estimate,”

“anticipate,” “project,” “will” and similar expressions

identify forward-looking statements, which generally

are not historical in nature.

I have been advised to put this disclaimer in as well:

Dr. Brown is not currently under treatment for any

disorders related to megalomania.

Introducing k-mers

CCGATTGCACTGGACCGA (<- read)

CCGATTGCAC

CGATTGCACT

GATTGCACTG

ATTGCACTGG

TTGCACTGGA

TGCACTGGAC

GCACTGGACC

ACTGGACCGA

De Bruijn graphs –assemble on overlaps

J.R. Miller et al. / Genomics (2010)

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC

CATGGACCGATTGCACTGGACCGATGCACGGTACCG

K-mers give you an implicit alignment


CATGGACCGATTGCACTGGACCGATGCACGGTACCG

CATGGACCGATTGCACTGGACCGATGCACGGACCG

(with no accounting for mismatches or indels)

The problem with k-mers


CATGGACCGATTGCACTCGACCGATGCACGGTACCG

Each sequencing error results in k novel k-mers!

The opportunity:


CATGGACCGATTGCACTCGACCGATGCACGGTACCG

The graph contains information about errors(can be used for error trimming in reads).

The graph also contains information about variants (can be used for variant calling).

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,

please email: [email protected]

One big challenge: scalability!

De Bruijn graph size scales with # errors.



Memory usage ~ “real” variation + number of errors

Number of errors ~ size of data set

Goals

• Initial goal: can we assemble large data sets??

• Longer-term goal: can we find efficient (De Bruijn?)

graph-based approaches to sequence analysis?

First attempt: compressible De Bruijn graphs

1% 5%

15%10%

Pell et al., 2012

Can use Bloom filters to store

De Bruijn graph structures.

=> Overall structure

remains as you squish graphs

down.

Technical challenges met (and defeated)

• Exhaustive in-memory traversal of graphs containing

5-15 billion nodes.

• Sequencing technology introduces false

connections in graph.

• Implementation lets us scale ~20x over other

approaches.

Pell et al., 2012

Technical challenges met (and defeated)

• Exhaustive in-memory traversal of graphs containing

5-15 billion nodes.

• Sequencing technology introduces false

connections in graph.

• Implementation lets us scale ~20x over other

approaches, but this is not enough.

• Although, see Minia assembler (Chikhi et al.)

Pell et al., 2012

Second attempt: diginorm




Random sampling => deep sampling

needed

Typically 10-100x needed for robust recovery (30-300 Gbp for human)

Actual coverage varies widely from the average.

Low coverage introduces unavoidable breaks.

But! Shotgun sequencing is very redundant!

Lots of the high coverage simply isn’t needed.

(unnecessary data)

Digital normalization

Contig assembly now scales with underlying genome size

• Transcriptomes, microbial genomes incl MDA,

and most metagenomes can be assembled in

under 50 GB of RAM, with identical or improved

results.

• Memory efficient is improved by use of CountMin

Sketch.

Brown et al., 2012, arXiv.

Diginorm is simple:

Diginorm is only a good start:

• Diginorm alters the coverage of the data

set.

• Diginorm also discards lots of data!

• Various other infelicities…

oRepeats go away!

oCoverage estimation approach ~poor.

Diginorm is a good start:

• Diginorm works on genomes,

metagenomes, and transcriptomes;

• Diginorm is streaming and uses

sublinear space.

Third attempt: a semi-streaming

framework for sequence analysis

https://github.com/ged-lab/2014-streaming/

Diginorm can detect graph saturation

Zhang et al., submitted.

This generically permits semi-

streaming approaches.


e.g. E. coli analysis => ~1.2 pass,

sublinear memory


=> Efficient k-mer error trimming.


(This all works on metagenomes & transcriptomes, too.)

Moving some sequence analysis to streaming.

~1.2 pass, sublinear memory


First pass: digital normalization - reduced set of k-mers.

Second pass: spectral analysis of data with reduced k-mer set.

First pass: collection of low-abundance reads + analysis of saturated reads.

Second pass: analysis of collected low-abundance reads.

First pass: collection of low-abundance reads + analysis of saturated reads.

(a)

(b)

(c)

two-pass;

reduced memory

few-pass;

reduced memory

online; streaming.

Sublinear time/space read error analysis --


Read error profile from mouse mRNAseq (c.f. Grabherr et al., 2011).

Another simple algorithm.


So, that’s pretty cool, right?

• We provide simple time- and memory-efficient approaches for k-mer spectral analysis of large data sets.

• These semi-streaming approaches provide a general framework for applying k-mer spectral approaches to all(deep) sequencing data, including genomes, metagenomes, and RNAseq.

• The khmer software provides a functional and reasonably efficient reference implementation, freely available under the BSD license and actively developed at github.com/ged-lab/.

Stream all the things! (1/2)

Stream all the things! (2/2)

But that’s not all!Buy now, and you can also get sequence-to-graph

alignment for the low, low price of free!*

graph = khmer.new_counting_hash(…)

aligner = khmer.ReadAligner(graph, trusted=5)

score, graph_align, read_align, is_truncated = \

aligner.align(seq)

* Terms and conditions may apply. Not all source code fully works :)

Pair-HMM-based graph alignment

Jordan Fish and Michael Crusoe

(Full model)

Jordan Fish and Michael Crusoe

This is a general API…Many potential uses:

• Error correction;

• Variant calling;

• Counting (to replace mapping) & allelic counts;

• Align to multiple references;

• Tackle strain variation and polyploidy;

• Building consensus graphs from shallow population

sequencing;

• Consensus graph building from multiple read types;

• Protein-guided graph search (BlastGraph & Xander)

Whole-genome variant calling

Graphalign is still alpha.• We don’t understand parameters well.

• Unoptimized.

• Not yet competitive with existing approaches.

• Broadly applicable!

• Hope to engage w/broader community, soon.

Concluding thoughts #1

• None of our theory is particularly limited to De Bruijn

graphs, although our implementation is deeply tied

to them at the moment.

• We view these ideas (streaming; graphs) as a

potentially substantial improvement over current

mainstream approaches.

• We are not alone – there is a larger community

exploring these approaches! (GA4GH, esp.)

Concluding thoughts #2

• Our implementations are usable but not yet terribly optimized.

• We are moving khmer towards a platform for providing reference implementations of these approaches, as well as for research and development.

• We are interested in providing components with decent performance & statistical guarantees, for fun and profit.

• Python and C++ FTW!

Thanks!

Please contact me at [email protected]!

mailto:[email protected]

2015 vancouver-vanbug

Science

Transcript of 2015 vancouver-vanbug