Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University...

Post on 15-Jan-2016

215 views 0 download

Tags:

Transcript of Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University...

1

Pamela Ferretti

Laboratory of Computational Metagenomics

Centre for Integrative BiologyUniversity of Trento

Italy

Microbial Genome Assembly

2

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

3

DNA packaging

4

DNA packaging

5

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

6

Next Generation Sequencing

TCTTATTGTGACC TAGGCTAGCTTAG

GCAATGCAGTAAC TCCAGCTAGGTTC

ACGTAGGCTAGCGTTAGCGA ........ CTGCAT C

7

Genome Assembly

1. GENOME SEQUENCING2. PRELIMINARY ANALYSIS3. ASSEMBLY4. ADVANCED BIOINFORMATIC ANALYSIS

OVERLAPPING SEQUENCE ALIGMENT

Sequencing the human genome with shotgun sequencing + assembly is the only feasible strategy

Computational assembly of shotgun sequencing data is simply unfeasible, and a bad idea anyway

Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.

Green, Philip. "Against a whole-genome shotgun.“Genome Research 7.5 (1997): 410-417.

They were both right!(…well, Weber and Myers were a bit more right from the practical viewpoint…)

On the feasibility of sequence assembly

9

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

10

Genome assembly strategies Greedy approach → SSAKE

De Bruijn graph (DBG) → Velvet, SOAPdenovo

Overlap Consensus Layout (OLC) → MIRA

Mixed approaches → MaSuRCA

11

Genome assembly strategies DE BRUIJN GRAPH APPROACH (DBG)

Velvet, SOAPdenovo2

Nodes = overlapping sequences of reads of uniform lengthEdges = kmer (unique subsequences within reads)

EULERIAN PATH

12

Genome assembly strategies

OVERLAP CONSENSUS LAYOUT (OLC)

MIRA

Nodes = readsEdges = overlap between reads

1. OVERLAP2. LAYOUT3. CONSENSUS

HAMILTONIAN PATH

13

Genome assembly strategies

14

Genome assembly strategies

DBG OLC

ADVANTAGES Very sensitive to repeats Modular algorithmic design

Kmer storaged just once Flexibility and robustness

Eulerian cycle

Never explicitly computes pairwise computation

DISADVANTAGES Sensitive to sequencing errors (new k-mers)

Hamiltonian cycle

Large computational memory space requirements

Overlap stage istime-consuming

Genome-size limitations

15

Greedy approach → SSAKE

De Bruijn graph (DBG) → Velvet, SOAPdenovo

Overlap Consensus Layout (OLC) → MIRA

Mixed approaches → MaSuRCA

Genome assembly strategies

16

Genome Assemblers

Average CoverageNumber of ContigsNumber of Contigs > 1KbN50 contig sizeFraction of reads assembledTotal consensus (in nt)Number of scaffolds N50 scaffolds size

Ion Torrent PGM → MIRA 3.9

Illumina → MaSuRCA MIRA 3.9 too produced good quality results, but it has a longer execution time

and it becomes unstable with large amount of small reads

17

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

18

Mycobacteria Assembly: Case Study

Responsible for many animal and human diseases M. tuberculosis and M. leprae (TM)M. fortuitum (NTM) outbreak (nail salon, 2002)M. chelonae (NTM) outbreak (face lifts, 2004)

Illumina HiSeq sequencing (NGS Facility – CIBIO/UNITN) Twenty mycobacterial strains From 20 different Mycobacteria species

→ MaSuRCA

Novel mycobacteria detection clinical tests

19

Fastq-mcf tool

• poor quality ends of reads• Ns, duplicates and sequencing

adapters• reads that are too short

Reduction up to 73%

Raw data quality assessment and pre-processing

20

K-mers: strings of a particular length k, which are shorter than entire reads

Best empirical k-mer length: 91 bases long

Assembly parameters setting

High coverage

21

MaSuRCA results of Mycobacteria

Abnormal GC content

Genome size too high

22

Examples of environmental contaminations

GC content based quality analysis

Staphylococcus epidermidis

Thanks

Photocoming

soon

http://gcat.davidson.edu/phast/#methods