De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... ·...

42
De novo genome assembly Dr Torsten Seemann IMB Winter School - Brisbane – Mon 7 July 2014

Transcript of De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... ·...

Page 1: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

De novo genome assembly

Dr Torsten Seemann

IMB Winter School - Brisbane – Mon 7 July 2014

Page 2: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Introduction

Page 3: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Ideal world

I would not need to give this talk!

AGTCTAGGATTCGCTACAGATTCAGGCTCTGAAGCTAGATCGCTATGCTATGATCTAGATCTCGAGATTCGTATAAGTCTAGGATTCGCTATAGATTCAGGCTCTGATATAT

Human DNA iSequencer™

46 complete haplotype

chromosome sequences

Page 4: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Real world

•  Can’t sequence full-length native DNA –  no instrument exists (yet)

•  But we can sequence short fragments

– 100 at a time (Sanger) – 100,000 at a time (Roche 454) – 1,000,000 at a time (PGM) – 10,000,000 at a time (Proton, MiSeq) – 100,000,000 at a time (HiSeq)

Page 5: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

De novo assembly

The process of reconstructing the original DNA sequence from the fragment reads alone.

•  Instinctively like a jigsaw puzzle

– Find reads which “fit together” (overlap) – Could be missing pieces (sequencing bias) – Some pieces will be dirty (sequencing errors)

Page 6: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

An example

Page 7: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

A small “genome”

Friends, Romans, countrymen, lend me your ears;

I’ll return them

tomorrow!

Page 8: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Shakespearomics •  Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

Oops! I dropped

them.

Page 9: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Shakespearomics •  Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

•  Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

I’m good with words.

Page 10: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Shakespearomics •  Reads

ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me

•  Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears;

•  Majority consensus Friends, Romans, countrymen, lend me your ears;

We have a consensus!

Page 11: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

So far, so good.

Page 12: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

The awful truth

“Genome assembly is impossible.”

A/Prof. Mihai Pop World leader in de novo assembly research.

He wears glasses so he must be

smart :-P

Page 13: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Methods

Page 14: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Approaches

•  greedy assembly •  overlap :: layout :: consensus •  de Bruijn graphs •  string graphs •  seed and extend

… all essentially doing the same thing, but taking different short cuts.

Page 15: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Assembly recipe

•  Find all overlaps between reads – hmm, sounds like a lot of work…

•  Build a graph – a picture of read connections

•  Simplify the graph – sequencing errors will mess it up a lot

•  Traverse the graph –  trace a sensible path to produce a consensus

Page 16: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Clean graph

Page 17: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Find read overlaps •  If we have N reads of length L

– we have to do ½N(N-1) ~ O(N²) comparisons – each comparison is an ~ O(L²) alignment – use special tricks/heuristics to reduce these!

•  What counts as “overlapping” ? – minimum overlap length eg. 20bp – minimum %identity across overlap eg. 95% – choice depends on L and expected error rate

Page 18: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

What we are up against!

Page 19: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

What ruins the graph? •  Read errors

–  introduce false edges and nodes

•  Non-haploid organisms – heterozygosity causes lots of detours

•  Repeats –  if longer than read length – causes nodes to be shared, locality confusion

Page 20: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Graph simplification

•  Squash small bubbles – collapse small errors (or minor heterozygosity)

•  Remove spurs

– short “dead end” hairs on the graph

•  Join unambiguously connected nodes –  reliable stretches of unique DNA

Page 21: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Graph traversal •  For each unconnected graph

–  at least one per replicon in original sample

•  Find a path which visits each node once –  Hamiltonian path/cycle is NP-hard (this is bad) –  solution will be a set of paths which terminate at

decision points

•  Form a consensus sequences from paths –  use all the overlap alignments –  each of these collapsed paths is a contig

Page 22: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Contigs

Contiguous, unambiguous stretches of assembled DNA sequence

•  Contigs ends correspond to – Real ends (for linear DNA molecules) – Dead ends (missing sequence) – Decision points (forks in the road)

Page 23: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Repeats

Page 24: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

What is a repeat?

A segment of DNA which occurs more than once in the genome sequence

•  Very common – Transposons (self replicating genes) – Satellites (repetitive adjacent patterns) – Gene duplications (paralogs)

Page 25: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Effect on assembly

The repeated element is collapsed into a single contig

Page 26: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Repeat mis-assembly

a b c

a c b

a b c d I II III

I

II

III a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement

Page 27: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

The law of repeats

•  It is impossible to resolve repeats of length S unless you have reads longer than S.

•  It is impossible to resolve repeats of

length S unless you have reads longer than S.

Page 28: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Scaffolding

Page 29: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Beyond contigs

Contig sizes are limited by: •  the length of repeats in your genome

– can’t change this!

•  the length (or “span”) of the reads – wait for new technology – use “tricks” with existing technology

Page 30: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Paired reads •  DNA fragment (200-800 bp) ==============================

•  Single end -------->=====================!

•  Paired end (up to 800 bp span) ----->==================<-----!

•  Mate pair (up to 20 kbp span) ---->========/+/=========<----!

Page 31: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Scaffolding

•  Paired-end reads – known sequences at either end –  roughly known distance between ends – unknown sequence between ends

•  Most ends will occur in same contig –  if our contigs are longer than pair distance

•  Some ends will be in different contigs – evidence that these contigs are linked!

Page 32: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Contigs to scaffolds

Contigs

Paired-end read

Scaffold Gap Gap

Page 33: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Assessment

Page 34: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Assessing assemblies

•  We desire – Total length similar to genome size – Fewer, larger contigs – No mistakes (mis-assemblies)

•  Metrics – No generally useful objective measure – Longest contig, total bp, N50, …

Page 35: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

The “N50”

The length of that contig from which 50% of the bases are in it and shorter contigs

•  Imagine we got 7 contigs with lengths: – 1,1,3,5,8,12,20

•  Total – 1+1+3+5+8+12+20 = 50

•  N50 is the “halfway sum” = 25 – 1+1+3+5+8+12 = 30 (≥ 25) so N50 is 12

Page 36: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

N50 concerns

•  Optimizing for N50 –  encourages mis-assemblies!

•  An aggressive assembler may over-join: – 1,1,3,5,8,12,20 (previous) – 1,1,3,5,20,20 (now) – 1+1+3+5+20+20 = 50 (unchanged)

•  N50 is the “halfway sum” (still 25) – 1+1+3+5+20= 30 (≥ 25) so N50 is 20 (was 12)

Page 37: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Validation

•  Self consistency – Align read back to contigs – Check for errors or discordant pairs

•  Second opinion

– Use two complementary sequencing methods – Target troublesome areas for PCR – Use a genome wide “optical map”

Page 38: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

How can I play?

Page 39: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Considerations •  Size of genome

– bacteria, eukaryote, meta-genome •  Hardware

– phone, laptop, desktop, server, cloud – RAM is more limiting than CPU

•  Operating system – Linux, Mac, Windows

•  Software budget –  commercial, free, open-source

Page 40: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Recommendations •  SPAdes

– Unix command-line (Mac, Linux)

•  VAGUE (Velvet) – Unix GUI (Mac, Linux)

•  CLC Genomics Workbench

– Java GUI (Windows, Mac, Linux) – Commercial product

Page 41: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Online tutorial

•  The GVL – Genomics Virtual Laboratory – http://genome.edu.au

•  Protocols – Microbial de novo assembly for Illumina data – Written by Simon Gladman (VBC/LSCC) – https://genome.edu.au/wiki/Protocols

Page 42: De novo genome assembly - Bioinformaticsbioinformatics.org.au/ws14/wp-content/uploads/ws14/... · • If we have N reads of length L – we have to do ½N(N-1) ~ O(N²) comparisons

Contact

•  Email –  [email protected]

•  Blog

– TheGenomeFactory.blogspot.com

•  Web – vicbioinformatics.com – vlsci.org.au/lscc – genome.edu.au

Torst!

~10!