Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad,...
-
Upload
doannguyet -
Category
Documents
-
view
232 -
download
0
Transcript of Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad,...
Metagenomics assembly:the good, the bad, and the ugly
Mihai Pop
2
Shotgun sequencing
shearing
sequencing
assemblyoriginal DNA (hopefully)
Why assembly?Since Technology Read length Throughput/run Throughput/hour cost/run
1977- Sanger sequencing
1000-2000bp
4hr400-500 kbp
100 kbp $200
2005- 454 pyrosequencing
400bp 4hr500 Mbp
>100 Mbp $17,000
2006- Illumina/Solexa 50-100bp 7-10 days250 Gbp
1 Gbp $27,000
2007- ABI SOLiD 35-50bp 3 days6-20 Gbp
75-250 Mbp $3-5,000
2012 - Pacific Biosciencessingle molecule
~10-20 kbp15% error
3 hours3 Mbp
1-3 Mbp $2,500
2014ish Oxford Nanoporesingle molecule
? ? ? ?
Viruses ~100kbpBacteria ~1-5 MbpMost Eukarya ~100s of MbpHuman ~ 3Gbp
4
A tale of assembly
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
5
Assembling two cities
the age of foolishness
best of times it
it was the best
it was the best
it was the age
it was the worst
was the best of
was the best of
it was the age
the best of times
of times it was
times it was thewas the worst of
the worst of times
worst of times it
of times it was
times it was the
of times it was
it was the age
was the age of
it was the age
was the age of
the age of wisdom
age of wisdom it
of wisdom it was
wisdom it was the
6
Greedy algorithm• Compute all pairwise overlaps• *Pick best (e.g. in terms of alignment score) overlap• Join corresponding reads• Repeat from * until no more joins possible
Greedy algorithm
Basis for many popular assembler: phrap, TIGR Assembler, CAP, etc.
7
Greedy approach gets 'stuck'
the age of foolishness
best of times it
it was the best
it was the bestit was the age
it was the worst
was the best of
was the best of
it was the age
the best of times
of times it was
times it was the
the worst of times
worst of times it
of times it was
of times it was
it was the ageit was the age
was the age of
the age of wisdom
age of wisdom it
of wisdom it was
wisdom it was the
8
Graph-based approaches
best of times it
it was the best
it was the age
it was the worst
was the best of
the best of times
times it was the
of times it was
was the worst of
the worst of times
worst of times it
of times it was
(meta)genome assembly is impossible
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
it was theof times
worst
age of
wisdom
foolishnessbest
it was the best of times it was the age of wisdom it was the age of foolishness it was the worst of times
Mycoplasma genitalium, 25 bp readsKingsford et al., BMC Bioinformatics 2010
Puzzle• 13 pieces
– 6,227,020,800 ways to order them– 8,192 ways to split them in two layers– 3,538,944 ways to arrange them into 6 "rows" of two pieces each
and one with three– ...
• Why is it hard?– Constraint: need to fit everything in 8 in3 – 0.5x8x2 prism
Some algorithms• Greedy
– take longest piece and put it in the box– fit as much as possible on same row– repeat with the remaining longest piece– etc...
• Pick through each of the possible orderings of pieces– put pieces in order in the box as they fit– eventually you'll hit on the right order...
A simpler solution
Violate implicit constraint – pieces must stay intact
3072 different solutions (I think)
http://www.elversonpuzzle.com/
$11.95 + S&H
Read length matters
it was theof times
worst
age of
wisdom
foolishnessbest
it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness
Read = 6 “words” (> length of repeat)
it was the best of times it, times it was the worst of, times it was the age of, was the age of wisdom it, ...
Read length matters
k = 50 k = 1,000 k = 5,000
Does anyone see the mistake in the picture?
Read length matters...
Nagarajan, Pop. J. Comp. Biol. 2009, Kingsford et al., BMC Bioinformatics 2010
• Reads (much) longer than repeats – assembly trivial
• Reads roughly equal to repeats – assembly computationally difficult (NP-hard)
• Reads shorter than repeats – assembly undetermined
Number of possible reconstructions exponential in # of repeats
What are repeats?Isolate genome
Metagenome
In metagenomes repeats are approximately genome-sized
Haplotype phasing with unknown number of haplotypes
Metagenomic questions
• What is the relative abundance of organism X versus organisms Y and Z?
• What proportion of organisms of type X have pathogenicity island P?
• Is pathogenicity island P only found in organism X or also in organisms Y and Z?
E. coli ETEC, EPEC, EAEC, EHEC, ...Shiga toxin in Shigella or E. coli, ...
What happens at low coverage?• Can you solve the puzzle if you remove one or a few
pieces?
• What can you say about the "solution"?
• How about a genome? What happens if you have gaps in coverage?
Lack of coverage leads to errors
it was theof times
worst
age of
wisdom
foolishnessbest
it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness
it was the worst of times it, times it was the worst of, times it was the age of, was the age of foolishness it
it was the worst of times it was the age of foolishness
The log puzzle solution is not consistent with the actual correct solution
Mate-pairs don't really help
Assembly is impossible• Long reads (even 10kbp) insufficient as repeats are as long
as genomes (100s of kbps to Mbps)• Errors impossible to avoid in low coverage genomes• Mate-pairs don't help• Computationally, assembly is very very hard
WHY BOTHER?
Assembly as compression
Stool sample SRS049995– in: 11.2 Gbp – out: 174 Mbp + 20 Mbp (unassembled reads)
Reads
Metaphyler
Metaphyler
ORF callingAssembly
Liu et al. BMC Genomics 2011, Treangen et al. Gen. Biol. 2014
functional profiling
pathway analysisstorage
etc.
Interesting genomes
• Most microbes are not easily cultured and only known by 16S rDNA signature– e.g. RDP grew from ~80,000 (v. 10.4) to 2.1 million (v 10.28)– only ~10,000 sequences from type strains– only ~150,000 sequences from isolate genomes
– metagenomic assembly is only way to get the rest
• Clinical studies reveal interesting 16S patterns – what do the genomes do?
Strain structure matters
Genes are easy...
Metagenomic assemblytechnical issues
Main challenges in metagenomic assembly● Difficult to find repeats
– coverage vs. over-representation– within-genome vs. across-genome repeats
● High genomic variation– sequencing experiment has ~1015 cells, i.e., each read comes
from a different cell – phages, transposons, etc. affect only a fraction of the
population even in 'homogeneous' strains
K-mer size and why it matters● overlaps versus errors or variants
– if k is too large, reads cannot be "stitched" together
● repeats
– if k is too small, unrelated reads get linked to each other
Repeat-detection/removal● Bambus 2 scaffolder (local coverage, graph theoretic
arguments)● IDBA-UD (local coverage, mate-pair information)
Polymorphisms in community● IDBA-UD – 'smooth' out variants● MetaVelvet – attempt to decompose graph the assemble
haploid genomes based on coverage concordance● MaryGold – detects some polymorphisms● Anv'io
Reference-guided assembly
● The entire read used to determine placement – repeats less of an isssue
● MetaCompass (https://github.com/marbl/MetaCompass)
Likelihood based assembly● Find string of letters that maximizes likelihood of reads● Common 'trick' in speech/language processing● Main approach for quasi-species assembly
– Shorah– Vispa
● 16S reconstruction– Emirge
● Metagenomic assembly– Genovo
● Won't cover it here but keep your ears open
New information: correlation across samples
Quince – ConcoctBorenstein – Metagenomic deconvolution
HMP 16S vs MetaHit MGS
HMP 16S vs MetaHit MGSCatabacter hongkongiensis
Christensenella minuta
Christensenellaceae
Assembly is just a small piece of the puzzle ● Taxonomic assessment● Gene Finding● Motif/variant detection● etc.
● The individual analyses can feed into each other– taxonomic assessment can help define assembly strategy– gene finding can highlight errors– etc.
metAMOS● Integrated pipeline for metagenomic assembly
(mothur/Qiime for WGS analysis)– assembly, scaffolding, gene finding, taxonomic profiling, ...– builds upon other open-source tools– modular pipeline design using Ruffus
● Specialized metagenomic/specialized analyses (through Bambus 2)– coverage-independent repeat detection– genomic variant detection
Koren et al. Bioinformatics 2011, Treangen et al. Genome Biology 2013
A bit about validation
What are errors?• Chimeric contigs/scaffolds (due to repeats or mixed
organisms)• Incorrect consensus calls
• Missing information– contigs/scaffolds broken up unnecessarily– missing variants
• Software errors– 15-50 bugs/1000 lines of code– Celera Assembler – 300,000 loc
Checking results• Contiguity is just part of the story• N50 doesn't make sense
– better measure: size to xx Mbp, number to xx Mbp
• Errors need to be taken into account– hard to do without a reference
Practical modeling - assembly invariants• Basic principles (modulo errors)
– overlapping reads must agree– mate-pairs must be consistently placed in assembly– coverage must match statistical process that generated the data– all reads must be used– assembly must be as contiguous as possible
• These assumptions mostly break/should be relaxed in metagenomic data
Schatz et al, Genome Biology 2008
Model-based testing
Unknown Genome Assembly
Magicbiological
biochemicalbiophysical
signal processingetc.
Reads
Assemblercomputational magic
Model of
Magic
Same?
Magicbiological
biochemicalbiophysical
signal processingetc.
Modeling approach...aside
• Originally proposed by Gene Myers (early-mid 90s)• Used in several metagenomic assemblers
– Genovo (Laserson et al.)– Vispa (Westbrooks et al.)– Shorah (Zagordi et al.)
• Idea of single number reflecting assembly qualityGenovo 'Score
denovo'
Note: need to know where every read is placed/could be placed!
(information rarely produced by 'modern' assemblers)
∑iSWScore i−2∗length(contigs)+ 2∗minOvl∗num(contigs)
ALE: Clark, S. et al. Bioinformatics 29(4): 435-443.LAP: Ghodsi, M., et al. BMC Res Notes 6(1): 334.CGAL: Rahman, A. and L. Pachter (2013). Genome Biol 14(1): R8.
https://github.com/marbl/VALET
Missassembly found by VALET
Conclusions• Genome assembly, in general, is well studied, and very hard• Key lesson: Garbage In Garbage Out (data more important
than algorithm)• Metagenomics offers valuable problems and information
(e.g.) scaffolding with cross-sample correlations
• Key: formalize and tackle sub-problems of interest to biologists– gene identification/clustering– comparative analysis wrt reference– reconstruct specific organism (rather than entire metagenome)
• Validation and standards are critical
E
A A
A
E
CB
I
GF
H J
D