2012 stamps-mbl-2

Metagenome assembly – part IIC. Titus [email protected]

Warnings

This talk contains forward looking statements. These forward-looking statements can be identified by terminology such as

“will”, “expects”, and “believes”.

-- Safe Harbor provisions of the

U.S. Private Securities Litigation Act

“Making predictions is difficult, especially if

they’re about the future.”

-- Attributed to Niels Bohr

The computational conundrum

More data => better.

and

More data => computationally more challenging.

Reads vs edges (memory) in de Bruijn graphs

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

2. Big data sets require big machines

For even relatively small data sets, metagenomic assemblers scale poorly.

Memory usage ~ “real” variation + number of errors

Number of errors ~ size of data set

Size of data set == big!!

(Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a slightly modified conventional assembler.)

Soil is full of uncultured microbes

Randy Jackson

SAMPLING LOCATIONS

Great Prairie sampling design

Reference core

Soil cores: 1 inch diameter 4 inches deep (litter and roots removed)

• Spatial samples: 16S rRNA, nifH

• Reference sample sequenced (small unmixed sample)

Reference bulk soil: stored for additional “omics” and metadata

10 M

10 M

1 M

1 M

1 cM

1 cM

100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 81000

200

400

600

800

1000

1200

1400

1600

1800

2000

Iowa CornIowa_Native_PrairieKansas CornKansas_Native_PrairieWisconsin CornWisconsin Native PrairieWisconsin Restored PrairieWisconsin Switchgrass

Number of Sequences

Num

ber o

f OTU

s

Soil contains thousands to millions of species(“Collector’s curves” of ~species)

The set of questions for soil -- discovery

What’s there?

Is it really that complex a community?

How “deep” do we need to sequence to sample thoroughly and systematically?

What organisms and gene functions are present, including non-canonical carbon and nitrogen cycling pathways?

What kind of organismal and functional overlap is there between different sites? (Total sampling needed?)

How is ecological complexity created & maintained?

How does ecological complexity respond to perturbation?

Why are we applying short-read sequencing to this problem!?

Short-read sampling is deep and quantitative. Statistical argument: your ability to observe rare

organisms – your sensitivity of measurement – is directly related to the number of independent sequences you take.

Longer reads (PacBio, 454, Ion Torrent) are less informative.

Majority of metagenome studies going forward will make use of Illumina.

BUT this kind of sequence is challenging to analyze.

BUT, BUT this kind of sequence is necessary for high complexity environments.

Challenges of short-read analysis

Low signal for functional analysis; no linkage at all.

High error rates.

Massive volume.

Rapidly changing technology.

Several approaches but we have settled on assembly.

Our “Grand Challenge” dataset

Approach 1: Partitioning

Split reads into “bins” belonging to different source species.

Can do this based almost entirely on connectivity of sequences.

Partitioning for scaling

Can be done in ~10x less memory than assembly.

Partition at low k and assemble exactly at any higher k (DBG).

Partitions can then be assembled independently Multiple processors -> scaling Multiple k, coverage -> improved assembly Multiple assembly packages (tailored to high variation, etc.)

Can eliminate small partitions/contigs in the partitioning phase.

An incredibly convenient approach enabling divide & conquer approaches across the board.

Technical challenges met (and defeated)

Novel data structure properties elucidated via percolation theory analysis (Pell et al., PNAS, 2012)

Exhaustive in-memory traversal of graphs containing 5-15 billion nodes.

Sequencing technology introduces false connections in graph (Howe et al., in prep.)

Only 20x improvement in assembly scaling .

Approach 2: Digital normalization

Suppose you have a dilution factor of A (10) to B(1). To

get 10x of B you need to get 100x of A! Overkill!!

This 100x will consume disk space and, because of

errors, memory.

(NOVEL)

Digital normalization discards redundant reads prior to assembly.

This removes reads and decreases data size, eliminates errors from removed reads, and normalizes coverage across loci.

Digital normalization algorithm

for read in dataset:

if median_kmer_count(read) < CUTOFF:

update_kmer_counts(read)

save(read)

else:

# discard read

Note, single pass; fixed memory.

Downsample based on de Bruijn graph structure (which can be derived online)

Shotgun data is often (1) high coverage and (2) biased in coverage.

(MD amplified)

Digital normalization fixes all that.

Normalizes coverage

Discards redundancy

Eliminates majority oferrors

Scales assembly dramatically.

Assembly is 98% identical.

Digital normalization retains information, while discarding data and errors

Other key points

Virtually identical contig assembly; scaffolding works but is not yet cookie-cutter.

Digital normalization changes the way de Bruijn graph assembly scales from the size of your data set to the size of the source sample.

Always lower memory than assembly: we never collect most erroneous k-mers.

Digital normalization can be done once – and then assembly parameter exploration can be done.

Quotable quotes.

Comment: “This looks like a great solution for people who can’t afford real computers”.

OK, but:

“Buying ever bigger computers is a great solution for people who don’t want to think hard.”

To be less snide: both kinds of scaling are needed, of course.

Why use diginorm?

Use the cloud to assemble any microbial genomes incl. single-cell, many eukaryotic genomes, most mRNAseq, and many metagenomes.

Seems to provide leverage on addressing many biological or sample prep problems (single-cell & genome amplification MDA; metagenome; heterozygosity).

And, well, the general idea of locus specific graph analysis solves lots of things…

Some interim concluding thoughts

Digital normalization-like approaches provide a path to solving the majority of assembly scaling problems, and will enable assembly on current cloud computing hardware. This is not true for highly diverse metagenome environments… For soil, we estimate that we need 50 Tbp / gram soil. Sigh.

Biologists and bioinformaticians hate: Throwing away data Caveats in bioinformatics papers (which reviewers like, note)

Digital normalization also discards abundance information.

Evaluating sensitivity & specificity

E. coli @ 10x + soil

98.5% of E. coli

Example

Dethlefsen shotgun data set / Relman lab

251 m reads / 16gb FASTQ gzipped~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2

(reads => final assembly + mapping)

Assembly stats:

58,224 contigs > 1000 bp (average 3kb)summing to 190 mb genomic

~38 microbial genomes worth of DNA~65% of reads mapped back to assembly

What do we get for soil?

Total Assembly Total Contigs % Reads

AssembledPredicted

protein coding

rplb genes

2.5 bill 4.5 mill 19% 5.3 mill 391

3.5 bill 5.9 mill 22% 6.8 mill 466

This estimates number of species ^

Adina Howe

Putting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bp

Coverage of Assemblies

Corn Prairie

Nearest reference in NCBI

Most abundant contigs in Iowa corn metagenome:Unknown; alpha/beta hydrolase (Streptomyces sp. S4); unknown; unknown; hypothetical protein HMP (Clostridium clostridioforme)

Most abundant contigs in Iowa prairie metagenome:hypothetical protein (Rhodanobacter sp. 2APBS1); hypothetical protein (Oryza sativa Japonica); outer membrane adhesin like proteiin (Solitalea canadensis) ; alcohol dehydrogenase zinc-binding domain protein (Ktedonobacter racemifer); alcohol dehydrogenase GroES domain protein (Ktedonobacter racemifer)

(Done with MEGAN)

How many soil samples do we need to sequence??

(Cumulative)Adina Howe

Overlap between Iowa prairie & Iowa corn is significant!

Extracting whole genomes?

So far, we have only assembled contigs, but not whole genomes.

Can entire genomes beassembled from metagenomicdata?

Iverson et al. (2012), fromthe Armbrust lab, contains a

technique for scaffoldingmetagenome contigs into~whole genomes. YES.

Perspective: the coming infopocalypse

Assembling about $20k worth of data, we can generate approximately 700 microbial genomes worth of data. (This is only going to go up in yield/$$, note.)

Most of these assembled genomic contigs

(and genes) do not belong to studied

organisms.

What the heck do they do??

More thoughts on assembly

Illumina is the only game in town for sequencing complex microbial populations, but dealing with the data (volume, errors) is tricky. This problem is being solved, by us and others.

We’re working to make it as close to push button as possible, with objectively argued parameters and tools, and methods for evaluating new tools and sequencing types.

The community is working on dealing with data downstream of sequencing & assembly. Most pipelines were built around 454 data – long reads, and

relatively few of them. With Illumina, we can get both long contigs and quantitative

information about their abundance. This necessitates changes to pipelines like MG-RAST and HUMANn.

The interpretation challenge

For soil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples.

The vast majority of this genomic DNA contains unknown genes with largely unknown function.

Most annotations of gene function & interaction are from a few phylogenetically limited model organisms Est 98% of annotations are computationally inferred: transferred from

model organisms to genomic sequence, using homology. Can these annotations be transferred? (Probably not.)

This will be the biggest sequence analysis challenge of the next 50 years.

Concluding thoughts on “assembly”

We can handle all the data (modulo another year or so of engineering.) Bring it on!

Our approaches let us (& you) assemble pretty much anything, much more easily than before. (Single cell, microbial genomes, transcriptomes, eukaryotic genomes, metagenomes, BAC sequencing…)

Seriously. No more problemo. Done. Finished. Kaput.

So now what? Validation. Interpretation and building general tools. Interpretation relies on annotation… (Uh oh.)

What are future needs?

High-quality, medium+ throughput annotation of genomes? Extrapolating from model organisms is both immensely important

and yet lacking. Strong phylogenetic sampling bias in existing annotations.

Synthetic biology for investigating non-model organisms?

(Cleverness in experimental biology doesn’t scale )

Integration of microbiology, community ecology/evolution modeling, and data analysis.

Replication fu

In December 2011, I met Wes McKinney on a train and he convinced me that I should look at IPython Notebook.

This is an interactive Web notebook for data analysis…

Hey, neat! We can use this for replication! All of our figures can be regenerated from scratch, on an EC2

instance, using a Makefile (data pipeline) and IPython Notebook (figure generation).

Everything is version controlled. Honestly not much work, and will be less the next time.

So… how’d that go?

People who already cared thought it was nifty.

http://ivory.idyll.org/blog/replication-i.html

Almost nobody else cares ;( Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh,

please read my letter to the end? “Could you improve your Makefile? I want to reimplement diginorm in

another language and reuse your pipeline, but your Makefile is a mess.”

Incredibly useful, nonetheless. Already part of undergraduate and graduate training in my lab; helping us and others with next parpes; etc. etc. etc.

Life is way too short to waste on unnecessarily replicating your own workflows, much less other people’s.

http://ivory.idyll.org/blog/replication-i.html

Acknowledgements

Lab members involvedCollaborators

Adina Howe (w/Tiedje) Jason Pell Arend Hintze Rosangela Canino-Koning Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald

Jim Tiedje, MSU

Billie Swalla, UW

Janet Jansson, LBNL

Susannah Tringe, JGI

Funding

USDA NIFA; NSF IOS; BEACON.

Current research in my labSolving the rest of your problems

Preliminary functional analysis

Search SSU rRNA gene in Illumina data

1. Randomly sequencing about 100bp long DNA in microbial genomes;

2. Everything is sequenced;

3. Not limited by primers or PCR bias;

4. Data mining is the challenge;

10^3

10^610^7 10^4

Reads #

SSU rRNA Gene length

Expected SSU RNA genefragments

Genome length

Classification: Pyrotag vs shotgun

RDP-pyrotag-SSUsilva-pyrotag-SSUsilva-shotgun-SSU

Primers used in 454 Titanium sequencing of SSU rRNA gene, using E.coli as an example. Consensus sequences of the primer region from Illumina reads suggest 1) searching method is good and 2)primer bias is minimal at the current E-value cutoff.

Start:907 End:1402

Forward

Reverse

1542 bp

Sequence logo of short reads at forward primer region:

AAACTYAAAKGAATTGACGGCurrent forward primer

GYACACACCGCCCGTCurrent reverse primer (reverse complement)

Sequence logo of short reads at reverse primer region:

CowRumen – JGI 16s primer mismatches

postion A T C G Total1G 0.001 0.001 0.002 0.996 121542T 0.002 0.983 0.003 0.012 121693G 0.001 0.001 0.002 0.995 121664C 0.001 0.001 0.996 0.002 121435C 0.003 0.001 0.994 0.002 121836A 0.986 0 0.008 0.005 122097G 0.001 0.001 0.002 0.996 121898C 0.001 0.001 0.996 0.002 121989A 0.978 0.001 0.017 0.004 12230

10G 0.001 0 0.002 0.997 1223111C 0.001 0.001 0.996 0.002 1219812C 0.002 0.001 0.994 0.003 1218513G 0 0 0.002 0.997 1219014C 0.001 0.001 0.995 0.003 1219515G 0.001 0.001 0 0.998 1221316G 0.001 0.001 0 0.998 1220617T 0.002 0.974 0.003 0.021 1217118A 0.99 0.001 0.006 0.003 1215019A 0.995 0.001 0.002 0.002 12106

Running HMMs over de Bruijn graphs(=> cross validation)

hmmgs: Assemble based on good-scoring HMM paths through the graph.

Independent of other assemblers; very sensitive, specific.

95% of hmmgs rplB domains are present in our partitioned assemblies.

GA

C AC

C

AC

TG

TAAT

AG

TT

CT

CT

TC

CTA

Jordan Fish, Qiong Wang, and Jim Cole (RDP)

Streaming error correction.

We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory.

We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment problem) to this

framework.

Side note: error correction is the biggest “data” problem left in

sequencing.

Both for mapping & assembly.

End:1402

Consensus of short reads at

reverse primer region:

Figure. Primers used in 454 Titanium sequencing of 16S rRNA gene, using E.coli as an

example. Consensus sequences of the primer region from Illumina reads suggest primer bias

is minimal at the current E-value cutoff.

Start:907Forward

1542 bp

Consensus of short reads at

forward primer region:

AAACTYAAAKGAATTGACGG

Current forward primer

Supplemental: abundance filtering is very lossy.

Total

3.8x partition

8.2x partition

Largest partition

0.010.0

20.030.0

40.050.0

60.070.0

80.090.0

100.0

Percent loss from abundance filtering (all >= 2)

contigsbp

Percentage lost

Comparing assemblers

Comparing assemblies / dendrogram

Integrating modeling into data analysis?

2012 stamps-mbl-2

Technology

Transcript of 2012 stamps-mbl-2