2012 stamps-mbl-2
-
Upload
ctitusbrown -
Category
Technology
-
view
1.013 -
download
0
Transcript of 2012 stamps-mbl-2
Metagenome assembly – part IIC. Titus [email protected]
Warnings
This talk contains forward looking statements. These forward-looking statements can be identified by terminology such as
“will”, “expects”, and “believes”.
-- Safe Harbor provisions of the
U.S. Private Securities Litigation Act
“Making predictions is difficult, especially if
they’re about the future.”
-- Attributed to Niels Bohr
The computational conundrum
More data => better.
and
More data => computationally more challenging.
Reads vs edges (memory) in de Bruijn graphs
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
2. Big data sets require big machines
For even relatively small data sets, metagenomic assemblers scale poorly.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Size of data set == big!!
(Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a slightly modified conventional assembler.)
Soil is full of uncultured microbes
Randy Jackson
SAMPLING LOCATIONS
Great Prairie sampling design
Reference core
Soil cores: 1 inch diameter 4 inches deep (litter and roots removed)
• Spatial samples: 16S rRNA, nifH
• Reference sample sequenced (small unmixed sample)
Reference bulk soil: stored for additional “omics” and metadata
10 M
10 M
1 M
1 M
1 cM
1 cM
100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 81000
200
400
600
800
1000
1200
1400
1600
1800
2000
Iowa CornIowa_Native_PrairieKansas CornKansas_Native_PrairieWisconsin CornWisconsin Native PrairieWisconsin Restored PrairieWisconsin Switchgrass
Number of Sequences
Num
ber o
f OTU
s
Soil contains thousands to millions of species(“Collector’s curves” of ~species)
The set of questions for soil -- discovery
What’s there?
Is it really that complex a community?
How “deep” do we need to sequence to sample thoroughly and systematically?
What organisms and gene functions are present, including non-canonical carbon and nitrogen cycling pathways?
What kind of organismal and functional overlap is there between different sites? (Total sampling needed?)
How is ecological complexity created & maintained?
How does ecological complexity respond to perturbation?
Why are we applying short-read sequencing to this problem!?
Short-read sampling is deep and quantitative. Statistical argument: your ability to observe rare
organisms – your sensitivity of measurement – is directly related to the number of independent sequences you take.
Longer reads (PacBio, 454, Ion Torrent) are less informative.
Majority of metagenome studies going forward will make use of Illumina.
BUT this kind of sequence is challenging to analyze.
BUT, BUT this kind of sequence is necessary for high complexity environments.
Challenges of short-read analysis
Low signal for functional analysis; no linkage at all.
High error rates.
Massive volume.
Rapidly changing technology.
Several approaches but we have settled on assembly.
Our “Grand Challenge” dataset
Approach 1: Partitioning
Split reads into “bins” belonging to different source species.
Can do this based almost entirely on connectivity of sequences.
Partitioning for scaling
Can be done in ~10x less memory than assembly.
Partition at low k and assemble exactly at any higher k (DBG).
Partitions can then be assembled independently Multiple processors -> scaling Multiple k, coverage -> improved assembly Multiple assembly packages (tailored to high variation, etc.)
Can eliminate small partitions/contigs in the partitioning phase.
An incredibly convenient approach enabling divide & conquer approaches across the board.
Technical challenges met (and defeated)
Novel data structure properties elucidated via percolation theory analysis (Pell et al., PNAS, 2012)
Exhaustive in-memory traversal of graphs containing 5-15 billion nodes.
Sequencing technology introduces false connections in graph (Howe et al., in prep.)
Only 20x improvement in assembly scaling .
Approach 2: Digital normalization
Suppose you have a dilution factor of A (10) to B(1). To
get 10x of B you need to get 100x of A! Overkill!!
This 100x will consume disk space and, because of
errors, memory.
(NOVEL)
Digital normalization discards redundant reads prior to assembly.
This removes reads and decreases data size, eliminates errors from removed reads, and normalizes coverage across loci.
Digital normalization algorithm
for read in dataset:
if median_kmer_count(read) < CUTOFF:
update_kmer_counts(read)
save(read)
else:
# discard read
Note, single pass; fixed memory.
Downsample based on de Bruijn graph structure (which can be derived online)
Shotgun data is often (1) high coverage and (2) biased in coverage.
(MD amplified)
Digital normalization fixes all that.
Normalizes coverage
Discards redundancy
Eliminates majority oferrors
Scales assembly dramatically.
Assembly is 98% identical.
Digital normalization retains information, while discarding data and errors
Other key points
Virtually identical contig assembly; scaffolding works but is not yet cookie-cutter.
Digital normalization changes the way de Bruijn graph assembly scales from the size of your data set to the size of the source sample.
Always lower memory than assembly: we never collect most erroneous k-mers.
Digital normalization can be done once – and then assembly parameter exploration can be done.
Quotable quotes.
Comment: “This looks like a great solution for people who can’t afford real computers”.
OK, but:
“Buying ever bigger computers is a great solution for people who don’t want to think hard.”
To be less snide: both kinds of scaling are needed, of course.
Why use diginorm?
Use the cloud to assemble any microbial genomes incl. single-cell, many eukaryotic genomes, most mRNAseq, and many metagenomes.
Seems to provide leverage on addressing many biological or sample prep problems (single-cell & genome amplification MDA; metagenome; heterozygosity).
And, well, the general idea of locus specific graph analysis solves lots of things…
Some interim concluding thoughts
Digital normalization-like approaches provide a path to solving the majority of assembly scaling problems, and will enable assembly on current cloud computing hardware. This is not true for highly diverse metagenome environments… For soil, we estimate that we need 50 Tbp / gram soil. Sigh.
Biologists and bioinformaticians hate: Throwing away data Caveats in bioinformatics papers (which reviewers like, note)
Digital normalization also discards abundance information.
Evaluating sensitivity & specificity
E. coli @ 10x + soil
98.5% of E. coli
Example
Dethlefsen shotgun data set / Relman lab
251 m reads / 16gb FASTQ gzipped~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2
(reads => final assembly + mapping)
Assembly stats:
58,224 contigs > 1000 bp (average 3kb)summing to 190 mb genomic
~38 microbial genomes worth of DNA~65% of reads mapped back to assembly
What do we get for soil?
Total Assembly Total Contigs % Reads
AssembledPredicted
protein coding
rplb genes
2.5 bill 4.5 mill 19% 5.3 mill 391
3.5 bill 5.9 mill 22% 6.8 mill 466
This estimates number of species ^
Adina Howe
Putting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bp
Coverage of Assemblies
Corn Prairie
Nearest reference in NCBI
Most abundant contigs in Iowa corn metagenome:Unknown; alpha/beta hydrolase (Streptomyces sp. S4); unknown; unknown; hypothetical protein HMP (Clostridium clostridioforme)
Most abundant contigs in Iowa prairie metagenome:hypothetical protein (Rhodanobacter sp. 2APBS1); hypothetical protein (Oryza sativa Japonica); outer membrane adhesin like proteiin (Solitalea canadensis) ; alcohol dehydrogenase zinc-binding domain protein (Ktedonobacter racemifer); alcohol dehydrogenase GroES domain protein (Ktedonobacter racemifer)
(Done with MEGAN)
How many soil samples do we need to sequence??
(Cumulative)Adina Howe
Overlap between Iowa prairie & Iowa corn is significant!
Extracting whole genomes?
So far, we have only assembled contigs, but not whole genomes.
Can entire genomes beassembled from metagenomicdata?
Iverson et al. (2012), fromthe Armbrust lab, contains a
technique for scaffoldingmetagenome contigs into~whole genomes. YES.
Perspective: the coming infopocalypse
Assembling about $20k worth of data, we can generate approximately 700 microbial genomes worth of data. (This is only going to go up in yield/$$, note.)
Most of these assembled genomic contigs
(and genes) do not belong to studied
organisms.
What the heck do they do??
More thoughts on assembly
Illumina is the only game in town for sequencing complex microbial populations, but dealing with the data (volume, errors) is tricky. This problem is being solved, by us and others.
We’re working to make it as close to push button as possible, with objectively argued parameters and tools, and methods for evaluating new tools and sequencing types.
The community is working on dealing with data downstream of sequencing & assembly. Most pipelines were built around 454 data – long reads, and
relatively few of them. With Illumina, we can get both long contigs and quantitative
information about their abundance. This necessitates changes to pipelines like MG-RAST and HUMANn.
The interpretation challenge
For soil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples.
The vast majority of this genomic DNA contains unknown genes with largely unknown function.
Most annotations of gene function & interaction are from a few phylogenetically limited model organisms Est 98% of annotations are computationally inferred: transferred from
model organisms to genomic sequence, using homology. Can these annotations be transferred? (Probably not.)
This will be the biggest sequence analysis challenge of the next 50 years.
Concluding thoughts on “assembly”
We can handle all the data (modulo another year or so of engineering.) Bring it on!
Our approaches let us (& you) assemble pretty much anything, much more easily than before. (Single cell, microbial genomes, transcriptomes, eukaryotic genomes, metagenomes, BAC sequencing…)
Seriously. No more problemo. Done. Finished. Kaput.
So now what? Validation. Interpretation and building general tools. Interpretation relies on annotation… (Uh oh.)
What are future needs?
High-quality, medium+ throughput annotation of genomes? Extrapolating from model organisms is both immensely important
and yet lacking. Strong phylogenetic sampling bias in existing annotations.
Synthetic biology for investigating non-model organisms?
(Cleverness in experimental biology doesn’t scale )
Integration of microbiology, community ecology/evolution modeling, and data analysis.
Replication fu
In December 2011, I met Wes McKinney on a train and he convinced me that I should look at IPython Notebook.
This is an interactive Web notebook for data analysis…
Hey, neat! We can use this for replication! All of our figures can be regenerated from scratch, on an EC2
instance, using a Makefile (data pipeline) and IPython Notebook (figure generation).
Everything is version controlled. Honestly not much work, and will be less the next time.
So… how’d that go?
People who already cared thought it was nifty.
http://ivory.idyll.org/blog/replication-i.html
Almost nobody else cares ;( Presub enquiry to editor: “Be sure that your paper can be reproduced.” Uh,
please read my letter to the end? “Could you improve your Makefile? I want to reimplement diginorm in
another language and reuse your pipeline, but your Makefile is a mess.”
Incredibly useful, nonetheless. Already part of undergraduate and graduate training in my lab; helping us and others with next parpes; etc. etc. etc.
Life is way too short to waste on unnecessarily replicating your own workflows, much less other people’s.
Acknowledgements
Lab members involvedCollaborators
Adina Howe (w/Tiedje) Jason Pell Arend Hintze Rosangela Canino-Koning Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald
Jim Tiedje, MSU
Billie Swalla, UW
Janet Jansson, LBNL
Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS; BEACON.
Current research in my labSolving the rest of your problems
Preliminary functional analysis
Search SSU rRNA gene in Illumina data
1. Randomly sequencing about 100bp long DNA in microbial genomes;
2. Everything is sequenced;
3. Not limited by primers or PCR bias;
4. Data mining is the challenge;
10^3
10^610^7 10^4
Reads #
SSU rRNA Gene length
Expected SSU RNA genefragments
Genome length
Classification: Pyrotag vs shotgun
RDP-pyrotag-SSUsilva-pyrotag-SSUsilva-shotgun-SSU
Primers used in 454 Titanium sequencing of SSU rRNA gene, using E.coli as an example. Consensus sequences of the primer region from Illumina reads suggest 1) searching method is good and 2)primer bias is minimal at the current E-value cutoff.
Start:907 End:1402
Forward
Reverse
1542 bp
Sequence logo of short reads at forward primer region:
AAACTYAAAKGAATTGACGGCurrent forward primer
GYACACACCGCCCGTCurrent reverse primer (reverse complement)
Sequence logo of short reads at reverse primer region:
CowRumen – JGI 16s primer mismatches
postion A T C G Total1G 0.001 0.001 0.002 0.996 121542T 0.002 0.983 0.003 0.012 121693G 0.001 0.001 0.002 0.995 121664C 0.001 0.001 0.996 0.002 121435C 0.003 0.001 0.994 0.002 121836A 0.986 0 0.008 0.005 122097G 0.001 0.001 0.002 0.996 121898C 0.001 0.001 0.996 0.002 121989A 0.978 0.001 0.017 0.004 12230
10G 0.001 0 0.002 0.997 1223111C 0.001 0.001 0.996 0.002 1219812C 0.002 0.001 0.994 0.003 1218513G 0 0 0.002 0.997 1219014C 0.001 0.001 0.995 0.003 1219515G 0.001 0.001 0 0.998 1221316G 0.001 0.001 0 0.998 1220617T 0.002 0.974 0.003 0.021 1217118A 0.99 0.001 0.006 0.003 1215019A 0.995 0.001 0.002 0.002 12106
Running HMMs over de Bruijn graphs(=> cross validation)
hmmgs: Assemble based on good-scoring HMM paths through the graph.
Independent of other assemblers; very sensitive, specific.
95% of hmmgs rplB domains are present in our partitioned assemblies.
GA
C AC
C
AC
TG
TAAT
AG
TT
CT
CT
TC
CTA
Jordan Fish, Qiong Wang, and Jim Cole (RDP)
Streaming error correction.
We can do error trimming of genomic, MDA, transcriptomic, metagenomic data in < 2 passes, fixed memory.
We have just submitted a proposal to adapt Euler or Quake-like error correction (e.g. spectral alignment problem) to this
framework.
Side note: error correction is the biggest “data” problem left in
sequencing.
Both for mapping & assembly.
End:1402
Consensus of short reads at
reverse primer region:
Figure. Primers used in 454 Titanium sequencing of 16S rRNA gene, using E.coli as an
example. Consensus sequences of the primer region from Illumina reads suggest primer bias
is minimal at the current E-value cutoff.
Start:907Forward
1542 bp
Consensus of short reads at
forward primer region:
AAACTYAAAKGAATTGACGG
Current forward primer
Supplemental: abundance filtering is very lossy.
Total
3.8x partition
8.2x partition
Largest partition
0.010.0
20.030.0
40.050.0
60.070.0
80.090.0
100.0
Percent loss from abundance filtering (all >= 2)
contigsbp
Percentage lost
Comparing assemblers
Comparing assemblies / dendrogram
Integrating modeling into data analysis?