Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad,...

Metagenomics assembly:the good, the bad, and the ugly

Mihai Pop

2

Shotgun sequencing

shearing

sequencing

assemblyoriginal DNA (hopefully)

Why assembly?Since Technology Read length Throughput/run Throughput/hour cost/run

1977- Sanger sequencing

1000-2000bp

4hr400-500 kbp

100 kbp $200

2005- 454 pyrosequencing

400bp 4hr500 Mbp

>100 Mbp $17,000

2006- Illumina/Solexa 50-100bp 7-10 days250 Gbp

1 Gbp $27,000

2007- ABI SOLiD 35-50bp 3 days6-20 Gbp

75-250 Mbp $3-5,000

2012 - Pacific Biosciencessingle molecule

~10-20 kbp15% error

3 hours3 Mbp

1-3 Mbp $2,500

2014ish Oxford Nanoporesingle molecule

? ? ? ?

Viruses ~100kbpBacteria ~1-5 MbpMost Eukarya ~100s of MbpHuman ~ 3Gbp

4

A tale of assembly

it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

5

Assembling two cities

the age of foolishness

best of times it

it was the best

it was the best

it was the age

it was the worst

was the best of

was the best of

it was the age

the best of times

of times it was

times it was thewas the worst of

the worst of times

worst of times it

of times it was

times it was the

of times it was

it was the age

was the age of

it was the age

was the age of

the age of wisdom

age of wisdom it

of wisdom it was

wisdom it was the

6

Greedy algorithm• Compute all pairwise overlaps• *Pick best (e.g. in terms of alignment score) overlap• Join corresponding reads• Repeat from * until no more joins possible

Greedy algorithm

Basis for many popular assembler: phrap, TIGR Assembler, CAP, etc.

7

Greedy approach gets 'stuck'

the age of foolishness

best of times it

it was the best

it was the bestit was the age

it was the worst

was the best of

was the best of

it was the age

the best of times

of times it was

times it was the

the worst of times

worst of times it

of times it was

of times it was

it was the ageit was the age

was the age of

the age of wisdom

age of wisdom it

of wisdom it was

wisdom it was the

8

Graph-based approaches

best of times it

it was the best

it was the age

it was the worst

was the best of

the best of times

times it was the

of times it was

was the worst of

the worst of times

worst of times it

of times it was

(meta)genome assembly is impossible

it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

it was theof times

worst

age of

wisdom

foolishnessbest

it was the best of times it was the age of wisdom it was the age of foolishness it was the worst of times

Mycoplasma genitalium, 25 bp readsKingsford et al., BMC Bioinformatics 2010

Puzzle• 13 pieces

– 6,227,020,800 ways to order them– 8,192 ways to split them in two layers– 3,538,944 ways to arrange them into 6 "rows" of two pieces each

and one with three– ...

• Why is it hard?– Constraint: need to fit everything in 8 in3 – 0.5x8x2 prism

Some algorithms• Greedy

– take longest piece and put it in the box– fit as much as possible on same row– repeat with the remaining longest piece– etc...

• Pick through each of the possible orderings of pieces– put pieces in order in the box as they fit– eventually you'll hit on the right order...

A simpler solution

Violate implicit constraint – pieces must stay intact

3072 different solutions (I think)

http://www.elversonpuzzle.com/

$11.95 + S&H

Read length matters

it was theof times

worst

age of

wisdom

foolishnessbest

it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness

Read = 6 “words” (> length of repeat)

it was the best of times it, times it was the worst of, times it was the age of, was the age of wisdom it, ...

Read length matters

k = 50 k = 1,000 k = 5,000

Does anyone see the mistake in the picture?

Read length matters...

Nagarajan, Pop. J. Comp. Biol. 2009, Kingsford et al., BMC Bioinformatics 2010

• Reads (much) longer than repeats – assembly trivial

• Reads roughly equal to repeats – assembly computationally difficult (NP-hard)

• Reads shorter than repeats – assembly undetermined

Number of possible reconstructions exponential in # of repeats

What are repeats?Isolate genome

Metagenome

In metagenomes repeats are approximately genome-sized

Haplotype phasing with unknown number of haplotypes

Metagenomic questions

• What is the relative abundance of organism X versus organisms Y and Z?

• What proportion of organisms of type X have pathogenicity island P?

• Is pathogenicity island P only found in organism X or also in organisms Y and Z?

E. coli ETEC, EPEC, EAEC, EHEC, ...Shiga toxin in Shigella or E. coli, ...

What happens at low coverage?• Can you solve the puzzle if you remove one or a few

pieces?

• What can you say about the "solution"?

• How about a genome? What happens if you have gaps in coverage?

Lack of coverage leads to errors

it was theof times

worst

age of

wisdom

foolishnessbest

it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishness

it was the worst of times it, times it was the worst of, times it was the age of, was the age of foolishness it

it was the worst of times it was the age of foolishness

The log puzzle solution is not consistent with the actual correct solution

Mate-pairs don't really help

Assembly is impossible• Long reads (even 10kbp) insufficient as repeats are as long

as genomes (100s of kbps to Mbps)• Errors impossible to avoid in low coverage genomes• Mate-pairs don't help• Computationally, assembly is very very hard

WHY BOTHER?

Assembly as compression

Stool sample SRS049995– in: 11.2 Gbp – out: 174 Mbp + 20 Mbp (unassembled reads)

Reads

Metaphyler

Metaphyler

ORF callingAssembly

Liu et al. BMC Genomics 2011, Treangen et al. Gen. Biol. 2014

functional profiling

pathway analysisstorage

etc.

Interesting genomes

• Most microbes are not easily cultured and only known by 16S rDNA signature– e.g. RDP grew from ~80,000 (v. 10.4) to 2.1 million (v 10.28)– only ~10,000 sequences from type strains– only ~150,000 sequences from isolate genomes

– metagenomic assembly is only way to get the rest

• Clinical studies reveal interesting 16S patterns – what do the genomes do?

Strain structure matters

Genes are easy...

Metagenomic assemblytechnical issues

Main challenges in metagenomic assembly● Difficult to find repeats

– coverage vs. over-representation– within-genome vs. across-genome repeats

● High genomic variation– sequencing experiment has ~1015 cells, i.e., each read comes

from a different cell – phages, transposons, etc. affect only a fraction of the

population even in 'homogeneous' strains

K-mer size and why it matters● overlaps versus errors or variants

– if k is too large, reads cannot be "stitched" together

● repeats

– if k is too small, unrelated reads get linked to each other

Repeat-detection/removal● Bambus 2 scaffolder (local coverage, graph theoretic

arguments)● IDBA-UD (local coverage, mate-pair information)

Polymorphisms in community● IDBA-UD – 'smooth' out variants● MetaVelvet – attempt to decompose graph the assemble

haploid genomes based on coverage concordance● MaryGold – detects some polymorphisms● Anv'io

Reference-guided assembly

● The entire read used to determine placement – repeats less of an isssue

● MetaCompass (https://github.com/marbl/MetaCompass)

Likelihood based assembly● Find string of letters that maximizes likelihood of reads● Common 'trick' in speech/language processing● Main approach for quasi-species assembly

– Shorah– Vispa

● 16S reconstruction– Emirge

● Metagenomic assembly– Genovo

● Won't cover it here but keep your ears open

New information: correlation across samples

Quince – ConcoctBorenstein – Metagenomic deconvolution

HMP 16S vs MetaHit MGS

HMP 16S vs MetaHit MGSCatabacter hongkongiensis

Christensenella minuta

Christensenellaceae

Assembly is just a small piece of the puzzle ● Taxonomic assessment● Gene Finding● Motif/variant detection● etc.

● The individual analyses can feed into each other– taxonomic assessment can help define assembly strategy– gene finding can highlight errors– etc.

metAMOS● Integrated pipeline for metagenomic assembly

(mothur/Qiime for WGS analysis)– assembly, scaffolding, gene finding, taxonomic profiling, ...– builds upon other open-source tools– modular pipeline design using Ruffus

● Specialized metagenomic/specialized analyses (through Bambus 2)– coverage-independent repeat detection– genomic variant detection

Koren et al. Bioinformatics 2011, Treangen et al. Genome Biology 2013

A bit about validation

What are errors?• Chimeric contigs/scaffolds (due to repeats or mixed

organisms)• Incorrect consensus calls

• Missing information– contigs/scaffolds broken up unnecessarily– missing variants

• Software errors– 15-50 bugs/1000 lines of code– Celera Assembler – 300,000 loc

Checking results• Contiguity is just part of the story• N50 doesn't make sense

– better measure: size to xx Mbp, number to xx Mbp

• Errors need to be taken into account– hard to do without a reference

Practical modeling - assembly invariants• Basic principles (modulo errors)

– overlapping reads must agree– mate-pairs must be consistently placed in assembly– coverage must match statistical process that generated the data– all reads must be used– assembly must be as contiguous as possible

• These assumptions mostly break/should be relaxed in metagenomic data

Schatz et al, Genome Biology 2008

Model-based testing

Unknown Genome Assembly

Magicbiological

biochemicalbiophysical

signal processingetc.

Reads

Assemblercomputational magic

Model of

Magic

Same?

Magicbiological

biochemicalbiophysical

signal processingetc.

Modeling approach...aside

• Originally proposed by Gene Myers (early-mid 90s)• Used in several metagenomic assemblers

– Genovo (Laserson et al.)– Vispa (Westbrooks et al.)– Shorah (Zagordi et al.)

• Idea of single number reflecting assembly qualityGenovo 'Score

denovo'

Note: need to know where every read is placed/could be placed!

(information rarely produced by 'modern' assemblers)

∑iSWScore i−2∗length(contigs)+ 2∗minOvl∗num(contigs)

ALE: Clark, S. et al. Bioinformatics 29(4): 435-443.LAP: Ghodsi, M., et al. BMC Res Notes 6(1): 334.CGAL: Rahman, A. and L. Pachter (2013). Genome Biol 14(1): R8.

https://github.com/marbl/VALET

Missassembly found by VALET

Conclusions• Genome assembly, in general, is well studied, and very hard• Key lesson: Garbage In Garbage Out (data more important

than algorithm)• Metagenomics offers valuable problems and information

(e.g.) scaffolding with cross-sample correlations

• Key: formalize and tackle sub-problems of interest to biologists– gene identification/clustering– comparative analysis wrt reference– reconstruct specific organism (rather than entire metagenome)

• Validation and standards are critical

E

A A

A

E

CB

I

GF

H J

D

Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad,...

Documents

Transcript of Metagenomics assembly: the good, the bad, and the … · Metagenomics assembly: the good, the bad,...