Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer...
Transcript of Correction of Sequencing Errors · Correction of sequencing errors in short read data k-mer...
Correction of Sequencing ErrorsSummer school on bioinformatics data structures
Leena Salmela
University of Helsinki
August 9th, 2016
Leena Salmela Correction of Sequencing Errors August 9th, 2016 1 / 39
This lecture was part of the 1st Summer School on Bioinformatics Data Structures, funded by BIRDS project (www.birdsproject.eu)This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690941
Outline
Introduction
Correction of sequencing errors in short read data
Error correction of long read data
Leena Salmela Correction of Sequencing Errors August 9th, 2016 2 / 39
Introduction
Outline
Introduction
Correction of sequencing errors in short read data
Error correction of long read data
Leena Salmela Correction of Sequencing Errors August 9th, 2016 3 / 39
Introduction
Sequencing technologies
Technology Read length Error rate Typical errorsIllumina 150 - 300 < 1% substitutionsPacific Biosciences 20000 > 15% indelsOxford NanoPore MinION up to 100000 10-30% indels
Genome
Sequencing
Leena Salmela Correction of Sequencing Errors August 9th, 2016 4 / 39
Introduction
Error correction problem
I High throughput sequencing produces large sets of short DNAsequences i.e. reads that may contain errors
I Sequencing errors greatly complicate de novo assemblyI Error correction aims at reducing the error rate prior assemblyI Input of error correction:
I k reads i.e. strings usually containing characters ACGTNI Each read may be from the forward or the reverse strandI The length of reads can vary from read to readI All (or at least most) reads come from the target genome but each read
may contain a small number of errors
Leena Salmela Correction of Sequencing Errors August 9th, 2016 5 / 39
Introduction
How does error correction work?
I DNA sequencing randomly samples reads from the genomeI The reads may contain errorsI Each position is sampled several timesI Errors can be detected by aligning the reads with each otherI SNPs or errors?
A C G G T A G A T G C T A G G G T A G T A G T . . .T A G A T G C T A G C T A G G G
. . . A C G G T A C T A A C T A G G G T A G TC G G T A G A T G C T A G C T A G A G T A G T . . .
T A G A T G C T A G C T A G G G T A
Leena Salmela Correction of Sequencing Errors August 9th, 2016 6 / 39
Correction of sequencing errors in short read data
Outline
Introduction
Correction of sequencing errors in short read data
Error correction of long read data
Leena Salmela Correction of Sequencing Errors August 9th, 2016 7 / 39
Correction of sequencing errors in short read data
Illumina: An example data set (E. coli K12)
I Read length: 100I Number of reads: 2.3 millionI Coverage: 50x
Score = 180 bits (97), Expect = 2e-46
Identities = 99/100 (99%), Gaps = 0/100 (0%)
Strand=Plus/Plus
Query 1 TTATTGTACAGCGCCCAGACAATTAACACGACGGCATTCGCCACTGCCAGCAAAAAATAG 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1085516 TTATTGTACAGCGCCCAGACAATTAACACGACGGCATTCGCCACTGCCAGCAAAAAATAG 1085575
Query 61 AACTGAAGTCCGCTTCTGGCCTCGCTTTGCCAGTAATAAC 100
|||||||||| |||||||||||||||||||||||||||||
Sbjct 1085576 AACTGAAGTCGGCTTCTGGCCTCGCTTTGCCAGTAATAAC 1085615
Leena Salmela Correction of Sequencing Errors August 9th, 2016 8 / 39
Correction of sequencing errors in short read data
k -mer spectrum based methods
I Consider the set of k -mers occuring in the readsI Solid k -mer: occurs at least M timesI Otherwise a k -mer is weak
I Find the minimum number of edits that make all k -mers solid in aread
I Dynamic programming
A C T G A A C G T A A G TA C T G A C G T
C T G A C G T AT G A A G T A A
G A A C T A A GA A C G A A G T
Leena Salmela Correction of Sequencing Errors August 9th, 2016 9 / 39
Correction of sequencing errors in short read data
k -mer spectrum based methods
I Consider the set of k -mers occuring in the readsI Solid k -mer: occurs at least M timesI Otherwise a k -mer is weak
I Find the minimum number of edits that make all k -mers solid in aread
I Dynamic programming
A C T G A A C G T A A G TA C T G A C G T
C T G A C G T AT G A A G T A A
G A A C T A A GA A C G A A G T
=⇒
A C T G A A G G T A A G TA C T G A G G T
C T G A G G T AT G A A G T A A
G A A G T A A GA A G G A A G T
Leena Salmela Correction of Sequencing Errors August 9th, 2016 9 / 39
Correction of sequencing errors in short read data
Alignment based methods
Two subproblems:I Identify clusters of reads originating from same genomic positionI Form a multiple alignment of a cluster of reads
Leena Salmela Correction of Sequencing Errors August 9th, 2016 10 / 39
Correction of sequencing errors in short read data
Identifying clusters of readsI Typically a k -mer index is usedI Construct a hash table that associates each k -mer to the reads
where it occurs either in forward or reverse orientationI Index only those k -mers that are lexicographically smaller than
their reverse complements
GTCAGAAGTCGTGGTAACCCTTGATAGGTATCAAGGGTTACCACGACTTTCTGCAGAAGTCGTGGTAACCCTTGATACCGTCGTGGTAACCTTGATACCACCGGGAACCGGTGGTATCAAGGGTTACCAGGTACCCCTTGATACCACCGGTTCA
⇒
...
CAGAAGTC: 1f 3f...
CTTCTGAC: 1r...
GGTTACCA: 1r 2f 3r 4r 5f...
TCAGAAGT: 1f...
Leena Salmela Correction of Sequencing Errors August 9th, 2016 11 / 39
Correction of sequencing errors in short read data
Forming multiple alignments
I Each multiple alignment is based on one read called the base readI Retrieve from the index all reads that share at least one k -mer
with the base read (k-mer neighborhood of the base read)I Heuristics needed to speed up construction of the multiple alignment
G T C A G A A – G T C G T G G T A A C C C T T G A T A C C A C C G G T T C A
G T C A G A A – G T C G T G G T A A C C C T T G A T AC A G A A A G T C G T G G T A A C C C T T G A T A C CC A G A A – G T C G T G G T A A C C C T T G A T A C C
G T C G T G G T A A C C – T T G A T A C C A C C G GT G G T A A C C C T T G A T A C C A C C G G T T C
G G T A C C C C T T G A T A C C A C C G G T T C A
Leena Salmela Correction of Sequencing Errors August 9th, 2016 12 / 39
Correction of sequencing errors in short read data
Suffix trie/array based methods
I Computing alignments is timeconsuming
I k -mer spectrum methods do nottake full advantage of context
I How about using a suffix trie forpresenting alignments implicitly?
A GT
$1
$2
$1 T A T$2
AG
GA T $2 $1 T
$1 T
$2
A
$1
$2
Leena Salmela Correction of Sequencing Errors August 9th, 2016 13 / 39
Correction of sequencing errors in short read data
Generalized suffix trie
I Generalized suffix trie of a set ofreads is a tree containing all thesuffixes of the reads
I Concatenate a unique symbol $i toeach read so that all the suffixes areunique
I Path label of a node is theconcatenation of edge labels on thepath from the root to the node
I Level of a node is the length of thepath from the root to the node
I Weight of a node is the number ofleaves in the subtrie rooted at thatnode
Generalized suffix trie for GATA$1
and ATGT$2
A GT
$1
$2
$1 T A T$2
AG
GA T $2 $1 T
$1 T
$2
A
$1
$2
For the red node:level=2, weight=2,path label=AT
Leena Salmela Correction of Sequencing Errors August 9th, 2016 14 / 39
Correction of sequencing errors in short read data
Suffix trie for error correction
I Build a suffix trie of the readsand their reverse complements
I Top levels:
Almost all nodeshave four children
I Intermediate levels:
Mostnodes have only one child. Ifthere are more children, thebranching is likely caused bysequencing errors.
I Lowest levels:
The weightsare too small to distinguishbetween erroneous andcorrect children
node with low weightnode with higher weight A T
level=r
Leena Salmela Correction of Sequencing Errors August 9th, 2016 15 / 39
Correction of sequencing errors in short read data
Suffix trie for error correction
I Build a suffix trie of the readsand their reverse complements
I Top levels: Almost all nodeshave four children
I Intermediate levels: Mostnodes have only one child. Ifthere are more children, thebranching is likely caused bysequencing errors.
I Lowest levels: The weightsare too small to distinguishbetween erroneous andcorrect children
node with low weightnode with higher weight A T
level=r
Leena Salmela Correction of Sequencing Errors August 9th, 2016 15 / 39
Correction of sequencing errors in short read data
Algorithm for correcting substitutions
I Traverse the nodes at theintermediate level of the trie
I Find a node that has morethan one child and some ofthe children have lower thanexpected weight.
I Compare the subtries rootedat the low weight node and itssibling nodes.
I If the sibling subtrie containsthe low weight subtrie, correctthe error by substituting thebase of the low weight nodewith the base of the sibling.
I Transfer the correction to thereads.
node with low weightnode with higher weight A T
level=r
Leena Salmela Correction of Sequencing Errors August 9th, 2016 16 / 39
Correction of sequencing errors in short read data
Algorithm for correcting insertions and deletions
I Insertions and deletions also cause extra branching in the suffix trie.I These errors can thus be similarly detected and corrected.
node with low weightnode with higher weight
A
A T
Insertion
node with low weightnode with higher weight T
T
A
Deletion
Leena Salmela Correction of Sequencing Errors August 9th, 2016 17 / 39
Correction of sequencing errors in short read data
Summary of short read error correction
I k -mer spectrum based methodsI Alignment based methodsI Suffix trie/array based methods
Leena Salmela Correction of Sequencing Errors August 9th, 2016 18 / 39
Error correction of long read data
Outline
Introduction
Correction of sequencing errors in short read data
Error correction of long read data
Leena Salmela Correction of Sequencing Errors August 9th, 2016 19 / 39
Error correction of long read data
PacBio: An example data set (E. coli K12)
I Mean read length: 7630I Median read length: 6280I Max read length: 35422I Error rate: ∼ 15 %
0
5000
10000
15000
20000
25000
30000
0 10000 20000 30000
Co
un
t
Read length
Score = 2002 bits (1084), Expect = 0.0
Identities = 1977/2341 (84%), Gaps = 330/2341 (14%)
Strand=Plus/Plus
Query 10122 ATCCAGTCCCCGGCAAGCTTGCTGC-AGAACTGCTCCGTGCTAAAATAGAAAGTTGCGGA 10180
||||||||||||||| ||| ||||| ||||||||||||||||||| || ||||||||| |
Sbjct 3817612 ATCCAGTCCCCGGCA-GCT-GCTGCCAGAACTGCTCCGTGCTAAA-TA-AAAGTTGCG-A 3817666
Query 10181 ACCAGGACCCCTTCACCAC-GTTCATTCAATGCATTAGCGCGCCCGG-TTAGCGGTATTC 10238
|||||||| |||| ||| | |||| || || || ||||||| ||||||||||||
Sbjct 3817667 -CCAGGACCG-TTCATCACTGG-CATT-AA-GC----GC-CGCCCGGGTTAGCGGTATTC 3817716
Query 10239 CCATTGCCATCACCCAGCGAGTAAAAGGTGCTGCTTACGAGCCAGAAATAGAAACTGATG 10298
|||||||||||||||||||||||||||| |||||||||||||||| || || ||||||
Sbjct 3817717 CCATTGCCATCACCCAGCGAGTAAAAGG--CTGCTTACGAGCCAGAT-TA-AA-CTGATG 3817771
Leena Salmela Correction of Sequencing Errors August 9th, 2016 20 / 39
Error correction of long read data
Long read error correction
I Error rate of PacBio reads is high, ∼15%I Correcting the errors simplifies further analysis like de novo assemblyI Two approaches:
I Use PacBio reads only, challenge: error rate of aligning two PacBioreads is ∼30%
I Use short accurate reads (typically Illumina) to correct the PacBioreads
Leena Salmela Correction of Sequencing Errors August 9th, 2016 21 / 39
Error correction of long read data
Alignment based approaches
PacBio read
Align short reads
Correct errors
Corrected PacBio read
I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high
error rate is slow.
Leena Salmela Correction of Sequencing Errors August 9th, 2016 22 / 39
Error correction of long read data
Alignment based approaches
PacBio read
Align short reads
Correct errors
Corrected PacBio read
I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high
error rate is slow.
Leena Salmela Correction of Sequencing Errors August 9th, 2016 22 / 39
Error correction of long read data
Alignment based approaches
PacBio read
Align short reads
Correct errors
Corrected PacBio read
I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high
error rate is slow.
Leena Salmela Correction of Sequencing Errors August 9th, 2016 22 / 39
Error correction of long read data
Alignment based approaches
PacBio read
Align short reads
Correct errors
Corrected PacBio read
I E.g. PacBioToCA, LSCI Problem: Aligning short reads to the PacBio reads allowing for high
error rate is slow.
Leena Salmela Correction of Sequencing Errors August 9th, 2016 22 / 39
Error correction of long read data
De Bruijn graph (DBG)
I Given a set of reads R1...Rn
I Extract all k -mers that occur in the readsI Form a graph:
I the k -mers are the vertices in the graphI draw an edge between two k -mers if they overlap by k − 1 bases
I (Alternative representation: (k − 1)-mers are vertices, an edgebetween two vertices if the k -mer occurs in the reads)
Leena Salmela Correction of Sequencing Errors August 9th, 2016 23 / 39
Error correction of long read data
De Bruijn graph (DBG)
I Given a set of reads R1...Rn
I Extract all k -mers that occur in the readsI Form a graph:
I the k -mers are the vertices in the graphI draw an edge between two k -mers if they overlap by k − 1 bases
I (Alternative representation: (k − 1)-mers are vertices, an edgebetween two vertices if the k -mer occurs in the reads)
Leena Salmela Correction of Sequencing Errors August 9th, 2016 23 / 39
Error correction of long read data
De Bruijn graph: Example
I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers:
ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:
ATG TGC GTG TGG GGC GCA GCG CGT
Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39
Error correction of long read data
De Bruijn graph: Example
I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG
,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:
ATG TGC GTG TGG GGC GCA GCG CGT
Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39
Error correction of long read data
De Bruijn graph: Example
I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC
,GCG,CGT,GTG,TGG,GGC,GCAI DBG:
ATG TGC GTG TGG GGC GCA GCG CGT
Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39
Error correction of long read data
De Bruijn graph: Example
I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCA
I DBG:
ATG TGC GTG TGG GGC GCA GCG CGT
Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39
Error correction of long read data
De Bruijn graph: Example
I Reads: ATGCGT, GCGTGG, GTGGCAI 3-mers: ATG,TGC,GCG,CGT,GTG,TGG,GGC,GCAI DBG:
ATG TGC GTG TGG GGC GCA GCG CGT
Leena Salmela Correction of Sequencing Errors August 9th, 2016 24 / 39
Error correction of long read data
DBG: Paths spell strings
I Consider the red path in the graphI The string spelled by this path is
G T GT G C
G C GC G T
G T G C G T
ATG TGC GTG TGG GGC GCA GCG CGT
Leena Salmela Correction of Sequencing Errors August 9th, 2016 25 / 39
Error correction of long read data
DBG based error correction: overview
I Build a DBG of the short accurate readsI Align the long PacBio reads against the DBGI Correct the long read based the alignment path in the DBG
Leena Salmela Correction of Sequencing Errors August 9th, 2016 26 / 39
Error correction of long read data
Solid and weak regions in long reads
I Classify k -mers in the long reads:I solid: in the DBGI weak: not in the DBG
I The long read now consists of solid and weak regions
bridge path
s1 t1
path not found
s2 t2
extension path
s3
Leena Salmela Correction of Sequencing Errors August 9th, 2016 27 / 39
Error correction of long read data
Correction a weak region in a long read
I Find paths in the DBG between the flanking solid k -mersI Minimize edit distance between the long read and the string spelled
by the path.I Allow only limited branching
bridge path
s1 t1
path not found
s2 t2
extension path
s3
Leena Salmela Correction of Sequencing Errors August 9th, 2016 28 / 39
Error correction of long read data
Correcting weak heads/tails of long reads
I Find a path in DBG starting from theextreme solid k -mer
I Maximize length of the prefix of the end tocorrect
I Minimize edit distance between the pathand the prefix of the end
I Find best extension maximizing analignment score
bridge path
s1 t1
path not found
s2 t2
extension path
s3
Leena Salmela Correction of Sequencing Errors August 9th, 2016 29 / 39
Error correction of long read data
What if short reads are not available?
I We can use the DBG based approach if the graph can be built onlong reads only
I This is possible ifI k is smallI coverage of the read set is high
=⇒ k -mers are present in the reads with high enough abundancyI Iterative approach with increasing k is beneficialI Additional step based on multiple alignments is beneficial to take
advantage of the long range information
Leena Salmela Correction of Sequencing Errors August 9th, 2016 30 / 39
Error correction of long read data
Data sets
E. coli YeastGenome size 4.6 Mbp 12 MbpPacBio coverage 208x 129xIllumina coverage 50x 38x
Leena Salmela Correction of Sequencing Errors August 9th, 2016 31 / 39
Error correction of long read data
Error correction tools
I Selfcorrection:I LoRDEC∗+LoRMA (DBG + alignment)I PBcR (Alignment)
I Hybrid correctionI LoRDEC and Jabba (DBG)I PBcR (Alignment)I proovread (Alignment)
Leena Salmela Correction of Sequencing Errors August 9th, 2016 32 / 39
Error correction of long read data
Aligning against reference
I Reads were aligned against the reference with BLASR.I We measured
I The proportion of reads that were correctedI The proportion of reads that was alignedI The genome coverageI The error rate
Leena Salmela Correction of Sequencing Errors August 9th, 2016 33 / 39
Error correction of long read data
E. coli
Size Aligned GenomeCoverage0
20
40
60
80
100
(%)
ErrorRate0
0.5
1
1.5
2
2.5
3
Original
PBcR (self)
LoRDEC*+LoRMA
LoRDEC
proovread
PBcR (hybrid)
Jabba
17
Leena Salmela Correction of Sequencing Errors August 9th, 2016 34 / 39
Error correction of long read data
Yeast
Size Aligned GenomeCoverage0
20
40
60
80
100
(%)
ErrorRate0
0.5
1
1.5
2
2.5
3
Original
PBcR (self)
LoRDEC*+LoRMA
LoRDEC
proovread
PBcR (hybrid)
Jabba
17
Leena Salmela Correction of Sequencing Errors August 9th, 2016 35 / 39
Error correction of long read data
Resources
Runtime(h) Memory(GB) Disk(GB)
0
10
20
30
40
E. coli
160
Runtime(h) Memory(GB) Disk(GB)
0
10
20
30
40
Yeast
PBcR (self)
LoRDEC*+LoRMA
LoRDEC
proovread
PBcR (hybrid)
Jabba
158
Leena Salmela Correction of Sequencing Errors August 9th, 2016 36 / 39
Error correction of long read data
Scaling to the parrot data (LoRDEC)
CPU time (h) Memory (GB) Disk (GB)0.1
1
10
100
1000
E. coliYeastParrot
Leena Salmela Correction of Sequencing Errors August 9th, 2016 37 / 39
Error correction of long read data
Summary of long read error correction
I Hybrid correction: using also short readsI Selfcorrection: using only long readsI Alignment based approachI DBG based approach
Leena Salmela Correction of Sequencing Errors August 9th, 2016 38 / 39
Error correction of long read data
Acknowledgements
I Eric Rivals, University of MontpellierI Esko Ukkonen, University of HelsinkiI Riku Walve, University of HelsinkiI Jan Schroder, Walter and Eliza Hall Institute of Medical Research,
Melbourne
Leena Salmela Correction of Sequencing Errors August 9th, 2016 39 / 39