Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer...

Correction of Sequencing ErrorsSummer school on bioinformatics data structures

Leena Salmela

University of Helsinki

August 9th, 2016

Leena Salmela Correction of Sequencing Errors August 9th, 2016 1 / 39

This lecture was part of the 1st Summer School on Bioinformatics Data Structures, funded by BIRDS project (www.birdsproject.eu)This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690941

Outline

Introduction

Correction of sequencing errors in short read data

Error correction of long read data


Introduction

Outline

Introduction




Introduction

Sequencing technologies

Technology Read length Error rate Typical errorsIllumina 150 - 300 < 1% substitutionsPacific Biosciences 20000 > 15% indelsOxford NanoPore MinION up to 100000 10-30% indels

Genome

Sequencing


Introduction

Error correction problem

I High throughput sequencing produces large sets of short DNAsequences i.e. reads that may contain errors

I Sequencing errors greatly complicate de novo assemblyI Error correction aims at reducing the error rate prior assemblyI Input of error correction:

I k reads i.e. strings usually containing characters ACGTNI Each read may be from the forward or the reverse strandI The length of reads can vary from read to readI All (or at least most) reads come from the target genome but each read

may contain a small number of errors


Introduction

How does error correction work?

I DNA sequencing randomly samples reads from the genomeI The reads may contain errorsI Each position is sampled several timesI Errors can be detected by aligning the reads with each otherI SNPs or errors?

A C G G T A G A T G C T A G G G T A G T A G T . . .T A G A T G C T A G C T A G G G

. . . A C G G T A C T A A C T A G G G T A G TC G G T A G A T G C T A G C T A G A G T A G T . . .

T A G A T G C T A G C T A G G G T A



Outline

Introduction





Illumina: An example data set (E. coli K12)

I Read length: 100I Number of reads: 2.3 millionI Coverage: 50x

Score = 180 bits (97), Expect = 2e-46

Identities = 99/100 (99%), Gaps = 0/100 (0%)

Strand=Plus/Plus

Query 1 TTATTGTACAGCGCCCAGACAATTAACACGACGGCATTCGCCACTGCCAGCAAAAAATAG 60

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Sbjct 1085516 TTATTGTACAGCGCCCAGACAATTAACACGACGGCATTCGCCACTGCCAGCAAAAAATAG 1085575

Query 61 AACTGAAGTCCGCTTCTGGCCTCGCTTTGCCAGTAATAAC 100

|||||||||| |||||||||||||||||||||||||||||

Sbjct 1085576 AACTGAAGTCGGCTTCTGGCCTCGCTTTGCCAGTAATAAC 1085615



k -mer spectrum based methods

I Consider the set of k -mers occuring in the readsI Solid k -mer: occurs at least M timesI Otherwise a k -mer is weak

I Find the minimum number of edits that make all k -mers solid in aread

I Dynamic programming

A C T G A A C G T A A G TA C T G A C G T

C T G A C G T AT G A A G T A A

G A A C T A A GA A C G A A G T



k -mer spectrum based methods

I Consider the set of k -mers occuring in the readsI Solid k -mer: occurs at least M timesI Otherwise a k -mer is weak

I Find the minimum number of edits that make all k -mers solid in aread

I Dynamic programming

A C T G A A C G T A A G TA C T G A C G T

C T G A C G T AT G A A G T A A

G A A C T A A GA A C G A A G T

=⇒

A C T G A A G G T A A G TA C T G A G G T

C T G A G G T AT G A A G T A A

G A A G T A A GA A G G A A G T



Alignment based methods

Two subproblems:I Identify clusters of reads originating from same genomic positionI Form a multiple alignment of a cluster of reads



Identifying clusters of readsI Typically a k -mer index is usedI Construct a hash table that associates each k -mer to the reads

where it occurs either in forward or reverse orientationI Index only those k -mers that are lexicographically smaller than

their reverse complements

GTCAGAAGTCGTGGTAACCCTTGATAGGTATCAAGGGTTACCACGACTTTCTGCAGAAGTCGTGGTAACCCTTGATACCGTCGTGGTAACCTTGATACCACCGGGAACCGGTGGTATCAAGGGTTACCAGGTACCCCTTGATACCACCGGTTCA

⇒

...

CAGAAGTC: 1f 3f...

CTTCTGAC: 1r...

GGTTACCA: 1r 2f 3r 4r 5f...

TCAGAAGT: 1f...



Forming multiple alignments

I Each multiple alignment is based on one read called the base readI Retrieve from the index all reads that share at least one k -mer

with the base read (k-mer neighborhood of the base read)I Heuristics needed to speed up construction of the multiple alignment

G T C A G A A – G T C G T G G T A A C C C T T G A T A C C A C C G G T T C A

G T C A G A A – G T C G T G G T A A C C C T T G A T AC A G A A A G T C G T G G T A A C C C T T G A T A C CC A G A A – G T C G T G G T A A C C C T T G A T A C C

G T C G T G G T A A C C – T T G A T A C C A C C G GT G G T A A C C C T T G A T A C C A C C G G T T C

G G T A C C C C T T G A T A C C A C C G G T T C A



Suffix trie/array based methods

I Computing alignments is timeconsuming

I k -mer spectrum methods do nottake full advantage of context

I How about using a suffix trie forpresenting alignments implicitly?

A GT

$1

$2

$1 T A T$2

AG

GA T $2 $1 T

$1 T

$2

A

$1

$2



Generalized suffix trie

I Generalized suffix trie of a set ofreads is a tree containing all thesuffixes of the reads

I Concatenate a unique symbol $i toeach read so that all the suffixes areunique

I Path label of a node is theconcatenation of edge labels on thepath from the root to the node

I Level of a node is the length of thepath from the root to the node

I Weight of a node is the number ofleaves in the subtrie rooted at thatnode

Generalized suffix trie for GATA$1

and ATGT$2

A GT

$1

$2

$1 T A T$2

AG

GA T $2 $1 T

$1 T

$2

A

$1

$2

For the red node:level=2, weight=2,path label=AT



Suffix trie for error correction

I Build a suffix trie of the readsand their reverse complements

I Top levels:

Almost all nodeshave four children

I Intermediate levels:

Mostnodes have only one child. Ifthere are more children, thebranching is likely caused bysequencing errors.

I Lowest levels:

The weightsare too small to distinguishbetween erroneous andcorrect children

node with low weightnode with higher weight A T

level=r



Suffix trie for error correction

I Build a suffix trie of the readsand their reverse complements

I Top levels: Almost all nodeshave four children

I Intermediate levels: Mostnodes have only one child. Ifthere are more children, thebranching is likely caused bysequencing errors.

I Lowest levels: The weightsare too small to distinguishbetween erroneous andcorrect children


level=r



Algorithm for correcting substitutions

I Traverse the nodes at theintermediate level of the trie

I Find a node that has morethan one child and some ofthe children have lower thanexpected weight.

I Compare the subtries rootedat the low weight node and itssibling nodes.

I If the sibling subtrie containsthe low weight subtrie, correctthe error by substituting thebase of the low weight nodewith the base of the sibling.

I Transfer the correction to thereads.


level=r



Algorithm for correcting insertions and deletions

I Insertions and deletions also cause extra branching in the suffix trie.I These errors can thus be similarly detected and corrected.

node with low weightnode with higher weight

A

A T

Insertion

node with low weightnode with higher weight T

T

A

Deletion



Summary of short read error correction

I k -mer spectrum based methodsI Alignment based methodsI Suffix trie/array based methods



Outline

Introduction





PacBio: An example data set (E. coli K12)

I Mean read length: 7630I Median read length: 6280I Max read length: 35422I Error rate: ∼ 15 %

0

5000

10000

15000

20000

25000

30000

0 10000 20000 30000

Co

un

t

Read length

Score = 2002 bits (1084), Expect = 0.0

Identities = 1977/2341 (84%), Gaps = 330/2341 (14%)

Strand=Plus/Plus

Query 10122 ATCCAGTCCCCGGCAAGCTTGCTGC-AGAACTGCTCCGTGCTAAAATAGAAAGTTGCGGA 10180

||||||||||||||| ||| ||||| ||||||||||||||||||| || ||||||||| |

Sbjct 3817612 ATCCAGTCCCCGGCA-GCT-GCTGCCAGAACTGCTCCGTGCTAAA-TA-AAAGTTGCG-A 3817666

Query 10181 ACCAGGACCCCTTCACCAC-GTTCATTCAATGCATTAGCGCGCCCGG-TTAGCGGTATTC 10238

|||||||| |||| ||| | |||| || || || ||||||| ||||||||||||

Sbjct 3817667 -CCAGGACCG-TTCATCACTGG-CATT-AA-GC----GC-CGCCCGGGTTAGCGGTATTC 3817716

Query 10239 CCATTGCCATCACCCAGCGAGTAAAAGGTGCTGCTTACGAGCCAGAAATAGAAACTGATG 10298

|||||||||||||||||||||||||||| |||||||||||||||| || || ||||||

Sbjct 3817717 CCATTGCCATCACCCAGCGAGTAAAAGG--CTGCTTACGAGCCAGAT-TA-AA-CTGATG 3817771



Long read error correction

I Error rate of PacBio reads is high, ∼15%I Correcting the errors simplifies further analysis like de novo assemblyI Two approaches:

I Use PacBio reads only, challenge: error rate of aligning two PacBioreads is ∼30%

I Use short accurate reads (typically Illumina) to correct the PacBioreads



Alignment based approaches

PacBio read

Align short reads

Correct errors

Corrected PacBio read

I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high

error rate is slow.



De Bruijn graph (DBG)

I Given a set of reads R1...Rn

I Extract all k -mers that occur in the readsI Form a graph:

I the k -mers are the vertices in the graphI draw an edge between two k -mers if they overlap by k − 1 bases

I (Alternative representation: (k − 1)-mers are vertices, an edgebetween two vertices if the k -mer occurs in the reads)



De Bruijn graph: Example

I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers:

ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:

ATG TGC GTG TGG GGC GCA GCG CGT




I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG

,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:





I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC

,GCG,CGT,GTG,TGG,GGC,GCAI DBG:





I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCA

I DBG:





I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:




DBG: Paths spell strings

I Consider the red path in the graphI The string spelled by this path is

G T GT G C

G C GC G T

G T G C G T




DBG based error correction: overview

I Build a DBG of the short accurate readsI Align the long PacBio reads against the DBGI Correct the long read based the alignment path in the DBG



Solid and weak regions in long reads

I Classify k -mers in the long reads:I solid: in the DBGI weak: not in the DBG

I The long read now consists of solid and weak regions

bridge path

s1 t1

path not found

s2 t2

extension path

s3



Correction a weak region in a long read

I Find paths in the DBG between the flanking solid k -mersI Minimize edit distance between the long read and the string spelled

by the path.I Allow only limited branching

bridge path

s1 t1

path not found

s2 t2

extension path

s3



Correcting weak heads/tails of long reads

I Find a path in DBG starting from theextreme solid k -mer

I Maximize length of the prefix of the end tocorrect

I Minimize edit distance between the pathand the prefix of the end

I Find best extension maximizing analignment score

bridge path

s1 t1

path not found

s2 t2

extension path

s3



What if short reads are not available?

I We can use the DBG based approach if the graph can be built onlong reads only

I This is possible ifI k is smallI coverage of the read set is high

=⇒ k -mers are present in the reads with high enough abundancyI Iterative approach with increasing k is beneficialI Additional step based on multiple alignments is beneficial to take

advantage of the long range information



Data sets

E. coli YeastGenome size 4.6 Mbp 12 MbpPacBio coverage 208x 129xIllumina coverage 50x 38x



Error correction tools

I Selfcorrection:I LoRDEC∗+LoRMA (DBG + alignment)I PBcR (Alignment)

I Hybrid correctionI LoRDEC and Jabba (DBG)I PBcR (Alignment)I proovread (Alignment)



Aligning against reference

I Reads were aligned against the reference with BLASR.I We measured

I The proportion of reads that were correctedI The proportion of reads that was alignedI The genome coverageI The error rate



E. coli

Size Aligned GenomeCoverage0

20

40

60

80

100

(%)

ErrorRate0

0.5

1

1.5

2

2.5

3

Original

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

17



Yeast

Size Aligned GenomeCoverage0

20

40

60

80

100

(%)

ErrorRate0

0.5

1

1.5

2

2.5

3

Original

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

17



Resources

Runtime(h) Memory(GB) Disk(GB)

0

10

20

30

40

E. coli

160

Runtime(h) Memory(GB) Disk(GB)

0

10

20

30

40

Yeast

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

158



Scaling to the parrot data (LoRDEC)

CPU time (h) Memory (GB) Disk (GB)0.1

1

10

100

1000

E. coliYeastParrot



Summary of long read error correction

I Hybrid correction: using also short readsI Selfcorrection: using only long readsI Alignment based approachI DBG based approach



Acknowledgements

I Eric Rivals, University of MontpellierI Esko Ukkonen, University of HelsinkiI Riku Walve, University of HelsinkiI Jan Schroder, Walter and Eliza Hall Institute of Medical Research,

Melbourne


Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer...

Documents

Transcript of Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer...