Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer...

49
Correction of Sequencing Errors Summer school on bioinformatics data structures Leena Salmela University of Helsinki August 9th, 2016 Leena Salmela Correction of Sequencing Errors August 9th, 2016 1 / 39 This lecture was part of the 1st Summer School on Bioinformatics Data Structures, funded by BIRDS project (www.birdsproject.eu) This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690941

Transcript of Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer...

Page 1: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of Sequencing ErrorsSummer school on bioinformatics data structures

Leena Salmela

University of Helsinki

August 9th, 2016

Leena Salmela Correction of Sequencing Errors August 9th, 2016 1 / 39

This lecture was part of the 1st Summer School on Bioinformatics Data Structures, funded by BIRDS project (www.birdsproject.eu)This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690941

Page 2: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Outline

Introduction

Correction of sequencing errors in short read data

Error correction of long read data

Leena Salmela Correction of Sequencing Errors August 9th, 2016 2 / 39

Page 3: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Introduction

Outline

Introduction

Correction of sequencing errors in short read data

Error correction of long read data

Leena Salmela Correction of Sequencing Errors August 9th, 2016 3 / 39

Page 4: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Introduction

Sequencing technologies

Technology Read length Error rate Typical errorsIllumina 150 - 300 < 1% substitutionsPacific Biosciences 20000 > 15% indelsOxford NanoPore MinION up to 100000 10-30% indels

Genome

Sequencing

Leena Salmela Correction of Sequencing Errors August 9th, 2016 4 / 39

Page 5: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Introduction

Error correction problem

I High throughput sequencing produces large sets of short DNAsequences i.e. reads that may contain errors

I Sequencing errors greatly complicate de novo assemblyI Error correction aims at reducing the error rate prior assemblyI Input of error correction:

I k reads i.e. strings usually containing characters ACGTNI Each read may be from the forward or the reverse strandI The length of reads can vary from read to readI All (or at least most) reads come from the target genome but each read

may contain a small number of errors

Leena Salmela Correction of Sequencing Errors August 9th, 2016 5 / 39

Page 6: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Introduction

How does error correction work?

I DNA sequencing randomly samples reads from the genomeI The reads may contain errorsI Each position is sampled several timesI Errors can be detected by aligning the reads with each otherI SNPs or errors?

A C G G T A G A T G C T A G G G T A G T A G T . . .T A G A T G C T A G C T A G G G

. . . A C G G T A C T A A C T A G G G T A G TC G G T A G A T G C T A G C T A G A G T A G T . . .

T A G A T G C T A G C T A G G G T A

Leena Salmela Correction of Sequencing Errors August 9th, 2016 6 / 39

Page 7: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Outline

Introduction

Correction of sequencing errors in short read data

Error correction of long read data

Leena Salmela Correction of Sequencing Errors August 9th, 2016 7 / 39

Page 8: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Illumina: An example data set (E. coli K12)

I Read length: 100I Number of reads: 2.3 millionI Coverage: 50x

Score = 180 bits (97), Expect = 2e-46

Identities = 99/100 (99%), Gaps = 0/100 (0%)

Strand=Plus/Plus

Query 1 TTATTGTACAGCGCCCAGACAATTAACACGACGGCATTCGCCACTGCCAGCAAAAAATAG 60

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct 1085516 TTATTGTACAGCGCCCAGACAATTAACACGACGGCATTCGCCACTGCCAGCAAAAAATAG 1085575

Query 61 AACTGAAGTCCGCTTCTGGCCTCGCTTTGCCAGTAATAAC 100

|||||||||| |||||||||||||||||||||||||||||

Sbjct 1085576 AACTGAAGTCGGCTTCTGGCCTCGCTTTGCCAGTAATAAC 1085615

Leena Salmela Correction of Sequencing Errors August 9th, 2016 8 / 39

Page 9: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

k -mer spectrum based methods

I Consider the set of k -mers occuring in the readsI Solid k -mer: occurs at least M timesI Otherwise a k -mer is weak

I Find the minimum number of edits that make all k -mers solid in aread

I Dynamic programming

A C T G A A C G T A A G TA C T G A C G T

C T G A C G T AT G A A G T A A

G A A C T A A GA A C G A A G T

Leena Salmela Correction of Sequencing Errors August 9th, 2016 9 / 39

Page 10: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

k -mer spectrum based methods

I Consider the set of k -mers occuring in the readsI Solid k -mer: occurs at least M timesI Otherwise a k -mer is weak

I Find the minimum number of edits that make all k -mers solid in aread

I Dynamic programming

A C T G A A C G T A A G TA C T G A C G T

C T G A C G T AT G A A G T A A

G A A C T A A GA A C G A A G T

=⇒

A C T G A A G G T A A G TA C T G A G G T

C T G A G G T AT G A A G T A A

G A A G T A A GA A G G A A G T

Leena Salmela Correction of Sequencing Errors August 9th, 2016 9 / 39

Page 11: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Alignment based methods

Two subproblems:I Identify clusters of reads originating from same genomic positionI Form a multiple alignment of a cluster of reads

Leena Salmela Correction of Sequencing Errors August 9th, 2016 10 / 39

Page 12: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Identifying clusters of readsI Typically a k -mer index is usedI Construct a hash table that associates each k -mer to the reads

where it occurs either in forward or reverse orientationI Index only those k -mers that are lexicographically smaller than

their reverse complements

GTCAGAAGTCGTGGTAACCCTTGATAGGTATCAAGGGTTACCACGACTTTCTGCAGAAGTCGTGGTAACCCTTGATACCGTCGTGGTAACCTTGATACCACCGGGAACCGGTGGTATCAAGGGTTACCAGGTACCCCTTGATACCACCGGTTCA

...

CAGAAGTC: 1f 3f...

CTTCTGAC: 1r...

GGTTACCA: 1r 2f 3r 4r 5f...

TCAGAAGT: 1f...

Leena Salmela Correction of Sequencing Errors August 9th, 2016 11 / 39

Page 13: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Forming multiple alignments

I Each multiple alignment is based on one read called the base readI Retrieve from the index all reads that share at least one k -mer

with the base read (k-mer neighborhood of the base read)I Heuristics needed to speed up construction of the multiple alignment

G T C A G A A – G T C G T G G T A A C C C T T G A T A C C A C C G G T T C A

G T C A G A A – G T C G T G G T A A C C C T T G A T AC A G A A A G T C G T G G T A A C C C T T G A T A C CC A G A A – G T C G T G G T A A C C C T T G A T A C C

G T C G T G G T A A C C – T T G A T A C C A C C G GT G G T A A C C C T T G A T A C C A C C G G T T C

G G T A C C C C T T G A T A C C A C C G G T T C A

Leena Salmela Correction of Sequencing Errors August 9th, 2016 12 / 39

Page 14: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Suffix trie/array based methods

I Computing alignments is timeconsuming

I k -mer spectrum methods do nottake full advantage of context

I How about using a suffix trie forpresenting alignments implicitly?

A GT

$1

$2

$1 T A T$2

AG

GA T $2 $1 T

$1 T

$2

A

$1

$2

Leena Salmela Correction of Sequencing Errors August 9th, 2016 13 / 39

Page 15: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Generalized suffix trie

I Generalized suffix trie of a set ofreads is a tree containing all thesuffixes of the reads

I Concatenate a unique symbol $i toeach read so that all the suffixes areunique

I Path label of a node is theconcatenation of edge labels on thepath from the root to the node

I Level of a node is the length of thepath from the root to the node

I Weight of a node is the number ofleaves in the subtrie rooted at thatnode

Generalized suffix trie for GATA$1

and ATGT$2

A GT

$1

$2

$1 T A T$2

AG

GA T $2 $1 T

$1 T

$2

A

$1

$2

For the red node:level=2, weight=2,path label=AT

Leena Salmela Correction of Sequencing Errors August 9th, 2016 14 / 39

Page 16: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Suffix trie for error correction

I Build a suffix trie of the readsand their reverse complements

I Top levels:

Almost all nodeshave four children

I Intermediate levels:

Mostnodes have only one child. Ifthere are more children, thebranching is likely caused bysequencing errors.

I Lowest levels:

The weightsare too small to distinguishbetween erroneous andcorrect children

node with low weightnode with higher weight A T

level=r

Leena Salmela Correction of Sequencing Errors August 9th, 2016 15 / 39

Page 17: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Suffix trie for error correction

I Build a suffix trie of the readsand their reverse complements

I Top levels: Almost all nodeshave four children

I Intermediate levels: Mostnodes have only one child. Ifthere are more children, thebranching is likely caused bysequencing errors.

I Lowest levels: The weightsare too small to distinguishbetween erroneous andcorrect children

node with low weightnode with higher weight A T

level=r

Leena Salmela Correction of Sequencing Errors August 9th, 2016 15 / 39

Page 18: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Algorithm for correcting substitutions

I Traverse the nodes at theintermediate level of the trie

I Find a node that has morethan one child and some ofthe children have lower thanexpected weight.

I Compare the subtries rootedat the low weight node and itssibling nodes.

I If the sibling subtrie containsthe low weight subtrie, correctthe error by substituting thebase of the low weight nodewith the base of the sibling.

I Transfer the correction to thereads.

node with low weightnode with higher weight A T

level=r

Leena Salmela Correction of Sequencing Errors August 9th, 2016 16 / 39

Page 19: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Algorithm for correcting insertions and deletions

I Insertions and deletions also cause extra branching in the suffix trie.I These errors can thus be similarly detected and corrected.

node with low weightnode with higher weight

A

A T

Insertion

node with low weightnode with higher weight T

T

A

Deletion

Leena Salmela Correction of Sequencing Errors August 9th, 2016 17 / 39

Page 20: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Correction of sequencing errors in short read data

Summary of short read error correction

I k -mer spectrum based methodsI Alignment based methodsI Suffix trie/array based methods

Leena Salmela Correction of Sequencing Errors August 9th, 2016 18 / 39

Page 21: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Outline

Introduction

Correction of sequencing errors in short read data

Error correction of long read data

Leena Salmela Correction of Sequencing Errors August 9th, 2016 19 / 39

Page 22: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

PacBio: An example data set (E. coli K12)

I Mean read length: 7630I Median read length: 6280I Max read length: 35422I Error rate: ∼ 15 %

0

5000

10000

15000

20000

25000

30000

0 10000 20000 30000

Co

un

t

Read length

Score = 2002 bits (1084), Expect = 0.0

Identities = 1977/2341 (84%), Gaps = 330/2341 (14%)

Strand=Plus/Plus

Query 10122 ATCCAGTCCCCGGCAAGCTTGCTGC-AGAACTGCTCCGTGCTAAAATAGAAAGTTGCGGA 10180

||||||||||||||| ||| ||||| ||||||||||||||||||| || ||||||||| |

Sbjct 3817612 ATCCAGTCCCCGGCA-GCT-GCTGCCAGAACTGCTCCGTGCTAAA-TA-AAAGTTGCG-A 3817666

Query 10181 ACCAGGACCCCTTCACCAC-GTTCATTCAATGCATTAGCGCGCCCGG-TTAGCGGTATTC 10238

|||||||| |||| ||| | |||| || || || ||||||| ||||||||||||

Sbjct 3817667 -CCAGGACCG-TTCATCACTGG-CATT-AA-GC----GC-CGCCCGGGTTAGCGGTATTC 3817716

Query 10239 CCATTGCCATCACCCAGCGAGTAAAAGGTGCTGCTTACGAGCCAGAAATAGAAACTGATG 10298

|||||||||||||||||||||||||||| |||||||||||||||| || || ||||||

Sbjct 3817717 CCATTGCCATCACCCAGCGAGTAAAAGG--CTGCTTACGAGCCAGAT-TA-AA-CTGATG 3817771

Leena Salmela Correction of Sequencing Errors August 9th, 2016 20 / 39

Page 23: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Long read error correction

I Error rate of PacBio reads is high, ∼15%I Correcting the errors simplifies further analysis like de novo assemblyI Two approaches:

I Use PacBio reads only, challenge: error rate of aligning two PacBioreads is ∼30%

I Use short accurate reads (typically Illumina) to correct the PacBioreads

Leena Salmela Correction of Sequencing Errors August 9th, 2016 21 / 39

Page 24: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Alignment based approaches

PacBio read

Align short reads

Correct errors

Corrected PacBio read

I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high

error rate is slow.

Leena Salmela Correction of Sequencing Errors August 9th, 2016 22 / 39

Page 25: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Alignment based approaches

PacBio read

Align short reads

Correct errors

Corrected PacBio read

I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high

error rate is slow.

Leena Salmela Correction of Sequencing Errors August 9th, 2016 22 / 39

Page 26: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Alignment based approaches

PacBio read

Align short reads

Correct errors

Corrected PacBio read

I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high

error rate is slow.

Leena Salmela Correction of Sequencing Errors August 9th, 2016 22 / 39

Page 27: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Alignment based approaches

PacBio read

Align short reads

Correct errors

Corrected PacBio read

I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high

error rate is slow.

Leena Salmela Correction of Sequencing Errors August 9th, 2016 22 / 39

Page 28: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

De Bruijn graph (DBG)

I Given a set of reads R1...Rn

I Extract all k -mers that occur in the readsI Form a graph:

I the k -mers are the vertices in the graphI draw an edge between two k -mers if they overlap by k − 1 bases

I (Alternative representation: (k − 1)-mers are vertices, an edgebetween two vertices if the k -mer occurs in the reads)

Leena Salmela Correction of Sequencing Errors August 9th, 2016 23 / 39

Page 29: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

De Bruijn graph (DBG)

I Given a set of reads R1...Rn

I Extract all k -mers that occur in the readsI Form a graph:

I the k -mers are the vertices in the graphI draw an edge between two k -mers if they overlap by k − 1 bases

I (Alternative representation: (k − 1)-mers are vertices, an edgebetween two vertices if the k -mer occurs in the reads)

Leena Salmela Correction of Sequencing Errors August 9th, 2016 23 / 39

Page 30: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

De Bruijn graph: Example

I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers:

ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:

ATG TGC GTG TGG GGC GCA GCG CGT

Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39

Page 31: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

De Bruijn graph: Example

I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG

,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:

ATG TGC GTG TGG GGC GCA GCG CGT

Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39

Page 32: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

De Bruijn graph: Example

I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC

,GCG,CGT,GTG,TGG,GGC,GCAI DBG:

ATG TGC GTG TGG GGC GCA GCG CGT

Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39

Page 33: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

De Bruijn graph: Example

I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCA

I DBG:

ATG TGC GTG TGG GGC GCA GCG CGT

Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39

Page 34: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

De Bruijn graph: Example

I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:

ATG TGC GTG TGG GGC GCA GCG CGT

Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39

Page 35: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

DBG: Paths spell strings

I Consider the red path in the graphI The string spelled by this path is

G T GT G C

G C GC G T

G T G C G T

ATG TGC GTG TGG GGC GCA GCG CGT

Leena Salmela Correction of Sequencing Errors August 9th, 2016 25 / 39

Page 36: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

DBG based error correction: overview

I Build a DBG of the short accurate readsI Align the long PacBio reads against the DBGI Correct the long read based the alignment path in the DBG

Leena Salmela Correction of Sequencing Errors August 9th, 2016 26 / 39

Page 37: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Solid and weak regions in long reads

I Classify k -mers in the long reads:I solid: in the DBGI weak: not in the DBG

I The long read now consists of solid and weak regions

bridge path

s1 t1

path not found

s2 t2

extension path

s3

Leena Salmela Correction of Sequencing Errors August 9th, 2016 27 / 39

Page 38: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Correction a weak region in a long read

I Find paths in the DBG between the flanking solid k -mersI Minimize edit distance between the long read and the string spelled

by the path.I Allow only limited branching

bridge path

s1 t1

path not found

s2 t2

extension path

s3

Leena Salmela Correction of Sequencing Errors August 9th, 2016 28 / 39

Page 39: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Correcting weak heads/tails of long reads

I Find a path in DBG starting from theextreme solid k -mer

I Maximize length of the prefix of the end tocorrect

I Minimize edit distance between the pathand the prefix of the end

I Find best extension maximizing analignment score

bridge path

s1 t1

path not found

s2 t2

extension path

s3

Leena Salmela Correction of Sequencing Errors August 9th, 2016 29 / 39

Page 40: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

What if short reads are not available?

I We can use the DBG based approach if the graph can be built onlong reads only

I This is possible ifI k is smallI coverage of the read set is high

=⇒ k -mers are present in the reads with high enough abundancyI Iterative approach with increasing k is beneficialI Additional step based on multiple alignments is beneficial to take

advantage of the long range information

Leena Salmela Correction of Sequencing Errors August 9th, 2016 30 / 39

Page 41: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Data sets

E. coli YeastGenome size 4.6 Mbp 12 MbpPacBio coverage 208x 129xIllumina coverage 50x 38x

Leena Salmela Correction of Sequencing Errors August 9th, 2016 31 / 39

Page 42: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Error correction tools

I Selfcorrection:I LoRDEC∗+LoRMA (DBG + alignment)I PBcR (Alignment)

I Hybrid correctionI LoRDEC and Jabba (DBG)I PBcR (Alignment)I proovread (Alignment)

Leena Salmela Correction of Sequencing Errors August 9th, 2016 32 / 39

Page 43: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Aligning against reference

I Reads were aligned against the reference with BLASR.I We measured

I The proportion of reads that were correctedI The proportion of reads that was alignedI The genome coverageI The error rate

Leena Salmela Correction of Sequencing Errors August 9th, 2016 33 / 39

Page 44: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

E. coli

Size Aligned GenomeCoverage0

20

40

60

80

100

(%)

ErrorRate0

0.5

1

1.5

2

2.5

3

Original

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

17

Leena Salmela Correction of Sequencing Errors August 9th, 2016 34 / 39

Page 45: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Yeast

Size Aligned GenomeCoverage0

20

40

60

80

100

(%)

ErrorRate0

0.5

1

1.5

2

2.5

3

Original

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

17

Leena Salmela Correction of Sequencing Errors August 9th, 2016 35 / 39

Page 46: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Resources

Runtime(h) Memory(GB) Disk(GB)

0

10

20

30

40

E. coli

160

Runtime(h) Memory(GB) Disk(GB)

0

10

20

30

40

Yeast

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

158

Leena Salmela Correction of Sequencing Errors August 9th, 2016 36 / 39

Page 47: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Scaling to the parrot data (LoRDEC)

CPU time (h) Memory (GB) Disk (GB)0.1

1

10

100

1000

E. coliYeastParrot

Leena Salmela Correction of Sequencing Errors August 9th, 2016 37 / 39

Page 48: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Summary of long read error correction

I Hybrid correction: using also short readsI Selfcorrection: using only long readsI Alignment based approachI DBG based approach

Leena Salmela Correction of Sequencing Errors August 9th, 2016 38 / 39

Page 49: Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer spectrum based methods I Consider the set of k-mers occuring in the reads I Solid k-mer:

Error correction of long read data

Acknowledgements

I Eric Rivals, University of MontpellierI Esko Ukkonen, University of HelsinkiI Riku Walve, University of HelsinkiI Jan Schroder, Walter and Eliza Hall Institute of Medical Research,

Melbourne

Leena Salmela Correction of Sequencing Errors August 9th, 2016 39 / 39