Metagenomics: the theory of assembly (and not only)
Transcript of Metagenomics: the theory of assembly (and not only)
![Page 1: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/1.jpg)
Metagenomics:the theory of assembly (and not only)
Mihai Pop
![Page 2: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/2.jpg)
Metagenomics• 1-5% of all bacteria can be cultured
– standard microbiology gives us a skewed view of the world
• Culture-free approaches– 16S rRNA sequencing– random sequencing of entire population
• 16S rRNA sequencing– tells us about relative diversity of organisms– no information about what these organisms do– 16S – multi-copy gene – difficult to estimate true abundances
• Random sequencing– potential to explore entire genomic content– requires deep sequencing (expensive)
![Page 3: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/3.jpg)
Why do we care?• Bacteria are everywhere in the environment• They are not all evil• Bacteria can be quite useful
– energy – bio-remediation– drug development
• Our bodies contain 1 order of magnitude more bacterial cells than human cells– critical to infant development (immune system, GI-tract)– provide essential nutrients (vitamin K, B12, essential amino-acids)– help digest complex molecules – starches, plant material– imbalances in normal bacterial populations correlate with disease
• Human microbiome project - nihroadmap.nih.gov/hmp/
![Page 4: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/4.jpg)
So...what are we looking for (17th century)?
![Page 5: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/5.jpg)
So...what are we looking for (21st century)?>F4BT0V001CZSIM rank=0000138 x=1110.0 y=2700.0 length=57
ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACGTCTG
>F4BT0V001BBJQS rank=0000155 x=424.0 y=1826.0 length=47
ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAA
>F4BT0V001EDG35 rank=0000182 x=1676.0 y=2387.0 length=44
ACTGACTGCATGCTGCCTCCCGTAGGAGTCGCCGTCCTCGACNC
>F4BT0V001D2HQQ rank=0000196 x=1551.0 y=1984.0 length=42
ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCGTCCCTCGAC
>F4BT0V001CM392 rank=0000206 x=966.0 y=1240.0 length=82
AANCAGCTCTCATGCTCGCCCTGACTTGGCATGTGTTAAGCCTGTAGGCTAGCGTTCATCCCTGAGCCAGGATCAAACTCTG
>F4BT0V001EIMFX rank=0000250 x=1735.0 y=907.0 length=46
ACTGACTGCATGCTGCCTCCCGTAGGAGTGTCGCGCCATCAGACTG
>F4BT0V001ENDKR rank=0000262 x=1789.0 y=1513.0 length=56
GACACTGTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACTCTG
>F4BT0V001D91MI rank=0000288 x=1637.0 y=2088.0 length=56
ACTGCTCTCATGCTGCCTCCCGTAGGAGTGCCTCCCTGAGCCAGGATCAAACTCTG
>F4BT0V001D0Y5G rank=0000341 x=1534.0 y=866.0 length=75
GTCTGTGACATGCTGCCTCCCGTAGGAGTCTACACAAGTTGTGGCCCAGAACCACTGAGCCAGGATCAAACTCTG
>F4BT0V001EMLE1 rank=0000365 x=1780.0 y=1883.0 length=84
ACTGACTGCATGCTGCCTCCCGTAGGAGTGCCTCCCTGCGCCATCAATGCTGCATGCTGCTCCCTGAGCCAGGATCAAACTCTG
![Page 6: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/6.jpg)
Same or different?• Is it real or noise?• Is it the same 'organism' that I've seen before?• What does it do?
Leeuwenhoekasked the samequestions
![Page 7: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/7.jpg)
Same or different in the sequence world• Is it real or noise?
– sequencing error correction– detection of experimental artifacts (contamination, chimeras, etc.)
• Is it the same organism I've seen before?– database searches
• What does it do?– more database searches
• Same broad analysis for 16S and Whole (meta)Genome Sequencing– the devil is in the details
![Page 8: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/8.jpg)
What is a species?• Concept is generally ill defined and impossible to define by
sequence alone
From: Eur J Clin Microbiol Infect Dis (2012) 31:899–904
![Page 9: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/9.jpg)
Same vs. different, 16S vs WGS?
16S WGS
![Page 10: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/10.jpg)
Metagenome assembly
![Page 11: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/11.jpg)
(meta)genome assembly is impossible
![Page 12: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/12.jpg)
(meta)genome assembly is impossible
actually...it's all about information
![Page 13: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/13.jpg)
Conservation of information
data in data outAlgorithm
I(in) >= I(out)
reads assemblyAssemblerSequencing
I(genome) >= I(reads) >= I(assembly)
![Page 14: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/14.jpg)
Mycoplasma genitalium, 25 bp readsKingsford et al., BMC Bioinformatics 2010
![Page 15: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/15.jpg)
Read length matters
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
Read = 3 “words” (<= length of repeat)
it was theof times
worst
age of
wisdom
foolishnessbest
it was the, was the best, was the worst, was the agethe age of,...
![Page 16: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/16.jpg)
Read length matters
it was theof times
worst
age of
wisdom
foolishnessbest
it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness
Read = 6 “words” (> length of repeat)
it was the best of times it, times it was the worst of, times it was the age of, was the age of wisdom it, ...
![Page 17: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/17.jpg)
Read length matters
k = 50 k = 1,000 k = 5,000
![Page 18: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/18.jpg)
Read length matters...
Nagarajan, Pop. J. Comp. Biol. 2009, Kingsford et al., BMC Bioinformatics 2010
• Reads (much) longer than repeats – assembly trivial
• Reads roughly equal to repeats – assembly computationally difficult (NP-hard)
• Reads shorter than repeats – assembly undetermined
Number of possible reconstructions exponential in # of repeats
![Page 19: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/19.jpg)
What are repeats?Isolate genome
Metagenome
In metagenomes repeats are approximately genome-sized
Haplotype phasing with unknown number of haplotypes
![Page 20: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/20.jpg)
Metagenomic questions
• What is the relative abundance of organism X versus organisms Y and Z?
• What proportion of organisms of type X have pathogenicity island P?
• Is pathogenicity island P only found in organism X or also in organisms Y and Z?
E. coli ETEC, EPEC, EAEC, EHEC, ...Shiga toxin in Shigella or E. coli, ...
![Page 21: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/21.jpg)
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishnessit was the epoch of belief it was the epoch of incredulityit was the season of light it was the season of darknessit was the spring of hope it was the winter of despair
it was the
of times
worst
epoch of
age of
wisdom
foolishness
best
incredulity
beliefseason of
light
darkness
spring of hope
winter of despair
![Page 22: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/22.jpg)
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishnessit was the epoch of belief it was the epoch of incredulityit was the season of light it was the season of darknessit was the spring of hope it was the winter of despairstable non-descript it was the likeliest thing uponpeculiarity was that it was the faintness of solitudequite sure that it was the prisonerParis as it was the episcopal mode amongbut it was the old scared lost lookmoreover it was the spot to whichthat night it was the fourteenth of August
it was the 17that it was 2the age of 2the epoch of 2the season of 2was the age 2was the epoch 2was the season 2was the spot 1...
But... coverage doesn't work in metagenomics,single cell genomics, etc.
![Page 23: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/23.jpg)
Lack of coverage leads to errors
it was theof times
worst
age of
wisdom
foolishnessbest
it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness
it was the worst of times it, times it was the worst of, times it was the age of, was the age of foolishness it
it was the worst of times it was the age of foolishness
![Page 24: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/24.jpg)
Assembly is impossible• Long reads (even 10kbp) insufficient as repeats are as long
as genomes (100s of kbps to Mbps)• Errors impossible to avoid in low coverage genomes• Computationally, assembly is very very hard
WHY BOTHER?
![Page 25: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/25.jpg)
Assembly as compression
Stool sample SRS049995– in: 11.2 Gbp – out: 174 Mbp + 20 Mbp (unassembled reads)
Reads
Metaphyler
Metaphyler
ORF callingAssembly
Liu et al. BMC Genomics 2011, Treangen et al. in preparation
functional profiling
pathway analysisstorageetc.
![Page 26: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/26.jpg)
Interesting genomes
• Most microbes are not easily cultured and only known by 16S rDNA signature– e.g. RDP grew from ~80,000 (v. 10.4) to 2.1 million (v 10.28)– only ~10,000 sequences from type strains– only ~150,000 sequences from isolate genomes
– metagenomic assembly is only way to get the rest
• Clinical studies reveal interesting 16S patterns
OTU: Gammaproteobacteria;Pasteurellales;Pasteurellaceae;???association with diarrhea p=10-135,13-fold increase in cases
![Page 27: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/27.jpg)
Strain structure matters
![Page 28: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/28.jpg)
Metagenomic assemblytechnical issues
![Page 29: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/29.jpg)
Main challenges in metagenomic assembly● Difficult to find repeats
– coverage vs. over-representation– within-genome vs. across-genome repeats
● High genomic variation– sequencing experiment has ~1015 cells, i.e., each read comes
from a different cell – phages, transposons, etc. affect only a fraction of the
population even in 'homogeneous' strains
![Page 30: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/30.jpg)
Some solutions● Coverage-independent repeat detection/removal
– Bambus 2 scaffolder (local coverage, graph theoretic arguments)
– IDBA-UD (local coverage, mate-pair information)● Polymorphisms in the community
– Bambus 2 scaffolder – preserve variants that don't 'tangle' the graph
– IDBA-UD – 'smooth' out variants– MetaVelvet – attempt to decompose graph the assemble
haploid genomes based on coverage concordance
Note: none of these features have been fully evaluated in a realistic setting.
![Page 31: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/31.jpg)
Aside: Repeat resolution with mate-pairs● Mate-pairs: pairs of sequencing reads separated by an
approximately known distance
● Commonly generated in (double barrelled) shotgunsequencing experiments
● Key idea: find (unique) path through assembly graphconsistent with length of a mate-pair
![Page 32: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/32.jpg)
Caveats● Finding path consistent with mate-pair length – NP-hard
● Heuristic: Mate-pair "useful" if shortest path between ends is consistent with mate-pair length
● A mate-pair "disambiguates" a section of the graph if there is a unique shortest path consistent with the mate-pair length
![Page 33: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/33.jpg)
To infinity and beyond
![Page 34: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/34.jpg)
Key insight● Non-trivial nodes can only be resolved by mate-pairs
that span them tightly
● Idea: pick the most useful mate-pairs
![Page 35: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/35.jpg)
Tuning leads to better assemblies
Average improvement: 47.52% Average improvement: 82.7%
Tuned librariesStandard libraries
![Page 36: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/36.jpg)
Likelihood based assembly● Find string of letters that maximizes likelihood of reads● Common 'trick' in speech/language processing● Main approach for quasi-species assembly
– Shorah– Vispa
● 16S reconstruction– Emirge
● Metagenomic assembly– Genovo
● Won't cover it here but keep your ears open
![Page 37: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/37.jpg)
Assembly is just a small piece of the puzzle ● Taxonomic assessment● Gene Finding● Motif/variant detection● etc.
● The individual analyses can feed into each other– taxonomic assessment can help define assembly strategy– gene finding can highlight errors– etc.
![Page 38: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/38.jpg)
Example: MetaCompass – comparative assembler
Liu et al. in preparation
![Page 39: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/39.jpg)
Aside: taxonomic classification with MetaPhyler● WGS sequences classified against marker gene database
(rpoB, recA, etc.)● Blast-based classifier● Different classifier built for each gene and taxonomic level● Classifier automatically adjusts for alignment length● Works with both DNA and protein data
metaphyler.cbcb.umd.edu
![Page 40: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/40.jpg)
metAMOS● Integrated pipeline for metagenomic assembly
(mothur/Qiime for WGS analysis)– assembly, scaffolding, gene finding, taxonomic profiling, ...– builds upon other open-source tools– modular pipeline design using Ruffus
● Specialized metagenomic/specialized analyses (through Bambus 2)– coverage-independent repeat detection– genomic variant detection
Koren et al. Bioinformatics 2011, Treangen et al. Genome Biology 2013
![Page 41: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/41.jpg)
Does it work?
![Page 42: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/42.jpg)
The nitty gritty details
![Page 43: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/43.jpg)
Introduction• Metagenomic assembly is still an active area of research• No metagenomic "sequencher" yet available/possible• Data-sets can be huge
– typical HMP data-set - ~8 Gbp – download time - > 1 hour– uncompressing (bzip2) - ~25 minutes – stringent alignment to reference (e.g. assembly) > 25 minutes
• Assemblers need a lot of memory (>> 4 GB)– 26 CPU, 68 GB RAM - $2/hour @ Amazon EC2– 48 CPU, 64 GB RAM - ~$30 K @ Dell
256 GB RAM - ~$50 K @ Dell(compare to Illumina instrument price)
![Page 44: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/44.jpg)
What you need for assembly• Sequencing reads
– fasta– fastq (horrible format...but a standard)– .sra – even worse than fastq but favored by NCBI
• Library information – not always easy to figure out (lab estimates way off)– script provided (compute_mates.pl) for estimating library size from
alignments to 16S rRNA
![Page 45: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/45.jpg)
Preprocessing• Guiding principle: Garbage In, Garbage Out• In general: by throwing away data, assembly can only
improve– quality trimming– removal of reads that are too short– removal of technical duplicates (artifact of many sequencing
technologies)– removal of contaminant sequences (e.g. human DNA)
• Don't be shocked if > 20% of the data are thrown out• Converting files is often challenging
![Page 46: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/46.jpg)
Checking results• Contiguity is just part of the story• N50 doesn't make sense
– better measure: size to xx Mbp, number to xx Mbp
• Errors need to be taken into account– hard to do without a reference
![Page 47: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/47.jpg)
A bit about validation
![Page 48: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/48.jpg)
What are errors?• Chimeric contigs/scaffolds (due to repeats or mixed
organisms)• Incorrect consensus calls
• Missing information– contigs/scaffolds broken up unnecessarily– missing variants
• Software errors– 15-50 bugs/1000 lines of code– Celera Assembler – 300,000 loc
![Page 49: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/49.jpg)
Model-based testing...2
Unknown Genome Assembly
Magicbiological
biochemicalbiophysical
signal processingetc.
Reads
Assemblercomputational magic
Model of
Magic
Same?
Magicbiological
biochemicalbiophysical
signal processingetc.
![Page 50: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/50.jpg)
Modeling approach...aside
• Originally proposed by Gene Myers (early-mid 90s)• Used in several metagenomic assemblers
– Genovo (Laserson et al.)– Vispa (Westbrooks et al.)– Shorah (Zagordi et al.)
• Idea of single number reflecting assembly qualityGenovo 'Score
denovo'
Note: need to know where every read is placed/could be placed!
(information rarely produced by 'modern' assemblers)
∑iSWScore i−2∗length(contigs)+ 2∗minOvl∗num(contigs)
ALE: Clark, S. et al. Bioinformatics 29(4): 435-443.LAP: Ghodsi, M., et al. BMC Res Notes 6(1): 334.CGAL: Rahman, A. and L. Pachter (2013). Genome Biol 14(1): R8.
![Page 51: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/51.jpg)
51
Probabilities correlate with reference validation
Data from Assemblathon 1 Earl et al. Genome Research 2011, Ghodsi et al., 2013.
![Page 52: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/52.jpg)
52
Sub-sampling based validation
![Page 53: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/53.jpg)
53
Used to tune assembly parameters
![Page 54: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/54.jpg)
Practical modeling - assembly invariants• Basic principles (modulo errors)
– overlapping reads must agree– mate-pairs must be consistently placed in assembly– coverage must match statistical process that generated the data– all reads must be used– assembly must be as contiguous as possible
• These assumptions mostly break/should be relaxed in metagenomic data
• Maximum likelihood approach should still work (though harder to interpret)
Schatz et al, Genome Biology 2008
![Page 55: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/55.jpg)
![Page 56: Metagenomics: the theory of assembly (and not only)](https://reader031.fdocuments.net/reader031/viewer/2022012409/616a456011a7b741a350adf0/html5/thumbnails/56.jpg)
Conclusions• Genome assembly, in general, is well studied, and very
hard• Key lesson: Garbage In Garbage Out (data more important
than algorithm)• Metagenomics offers valuable problems and information
(e.g.) scaffolding with cross-sample correlations
• Key: formalize and tackle sub-problems of interest to biologists– gene identification/clustering– comparative analysis wrt reference– reconstruct specific organism (rather than entire metagenome)
• Validation and standards are critical