AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.
Transcript of AutoEditor Automated base caller error correction tool Slides courtesy of Pawel Gajer, Ph.D.
AutoEditorAutoEditor
Automated base caller error correction toolAutomated base caller error correction tool
Slides courtesy ofSlides courtesy ofPawel Gajer, Ph.D.Pawel Gajer, Ph.D.
AutoEditorBase-calling in the context of single chromatogram is hard…
but finding base-calling “mistakes” in a multiple alignment is easy.
• Principal and secondary aims of AutoEditor• AutoEditor as a higher level base caller• Tiling discrepancy types• Base caller error types• Resolving discrepancies of the form B…B*• Resolving discrepancies of the form *…*B• AutoEditor statistics
A principal goal of AutoEditor is to automatically correct a majority of tiling discrepancies, reducing human editing effort to the most problematic discrepancy types.
A tiling discrepancy is any deviation from the homogeneous coverage of a consensus base.
autoEditor as a higher level base caller
single read trace data base caller nucleotide sequence
tiling of reads
tiling discrepancies multiple read trace data
autoEditor
list of corrected discrepancies
Other applications:
• Clear range editing (read expansion)
• SNP detection
Clear range editing
single read quality values datatrimming algorithm
trimmed read
less stringently trimmed reads
assembler
tiling of reads autoEditor
SNP detection
Alignment data of genome 1
Alignment data of genome 2
Combined genomes alignment data List of putative SNPs
autoEditor
List of putative SNPs that pass autoEditor error screening
Tiling discrepancy types
Single deletion:
Single insertion:
Single insertion and single deletion are extreme cases of insertion/deletion discrepancies
A A A AA A A *A A * *A * * ** * * *
The above sequence of discrepancies can be representedschematically as an edge in a two vertex graph:
A *
The configuration space of all tiling discrepancy types can be schematically represented as a 4-dimensional simplex
A
T
C
G
*
support
supportsupport (b)
amplitude (a)
minimum difference between amplitude and local minimum (c)
Open dots on the signal curve indicate local maxima and open circles indicate local minima.
Re-calling individual bases
Base caller error types
• Missed signal
• Signal shift
•Unresolved peaks
Resolving a single deletion discrepancy
compute discrepancy’s read multiplicity: mult
if mult = 0 then check for a missed signal error
if |mult| > 0 then check for a signal shift errorif it is not a signal shift error then it is a unresolved peaks error
To resolve it, find two other reads with well resolved peaks over the unresolved peaks
bases
A discrepancy read multiplicity is the number of bases to the right or left (negative sign) of the discrepancy positions equal to the consensus base covering the discrepancy.
Resolving a single insertion discrepancy
compute discrepancy’s read multiplicity - mult
if mult = 0 then check if the signal parameters are within allowable ranges
if | mult | > 0 then check if the insertion base is a part of |mult |+1 well-
resolved signal peaksif not find two other reads whose traces have exactly |mult | well-
resolved signal peaks between the bases flanking the discrepancy position
mult = 0, weak signal error
mult = -2, unresolved peakserror with two other readswith exactly 2 signal peaksbetween Gs flanking AA*
from Nov 12, 2002 Test set: the first 10 contigs of Mycoplasma arthritidisasmbl_id size(kb) # corrections # autoEdit # errors in
errors newer autoEdit1 132 124 3 02 64 78 4 13 40 55 3 04 53 45 2 15 16 15 0 06 22 29 1 07 23 19 0 08 51 48 1 09 26 33 1 010 15 15 0 0----------------------------------------------------------------------Total: 442 461 15 2
~3.25% ~0.43%
Missed-signal (MS) and signal shift (SS) correction errors AutoEditor version 1.1
Test set: the first 10 contigs of Mycoplasma arthritidis
asmbl_id size(in kb) #disc #corr %corr
1 132 3390 3266 96%2 64 2195 2142 98%3 40 1344 1325 99%4 53 1304 1242 95%5 16 508 487 96%6 22 777 757 97%7 23 624 613 98%8 51 1303 1232 95%9 26 783 760 97%10 15 437 423 97%--------------------------------------------------------------------Total: 442 12665 12065 95%
where #disc is the total number of discrepancies in the given contig#corr is the number of corrected discrepancies%corr is the percentage of corrected discrepancies
AutoEditor version 1.2 correcting all single deletion errors
Organism Discrep’s Corrected % Contig
Discrep’s Corrected % Acidobacterium capsulatum 103539 93729 90.5% 99555 89977 90.4% Neorickettsia sennetsu Miyayama 41408 37425 90.4% 38355 34579 90.2% Bacillus anthracis Kruger B 317745 284503 89.5% 296222 264646 89.3% Coxiella burnetii 131183 117232 89.4% 118723 105562 88.9% Dichelobacter nodosus 83804 73547 87.8% 76766 67900 88.5% Clostridium perfringens 71928 62822 87.3% 66546 59929 90.1% Mycoplasma capricolum 17805 15444 86.7% 16574 14584 88.0% Brucella suis 129870 112359 86.5% 120799 105250 87.1% Plasmodium vivax 783495 655642 83.7% 734298 618268 84.2% Pseudomonas fluorescens 234264 194771 83.1% 224049 186276 83.1% Campylobacter jejuni 96231 79237 82.3% 88800 73940 83.3% Fibrobacter succinogenes 243270 196150 80.6% 208790 175294 84.0% Erwinia chrysanthemi 219370 176354 80.4% 205161 165070 80.5% Mycobacterium smegmatis 433105 346503 80.0% 363017 309774 85.3% Prevotella intermedia 118857 94162 79.2% 110750 87931 79.4% Pseudomonas syringae 227887 177897 78.1% 200223 164561 82.2% Silicibacter pomeroyi 156130 116907 74.9% 148006 112093 75.7% Chlamydophila caviae 50137 36972 73.7% 47875 35103 73.3% Wolbachia sp. 70782 51163 72.3% 57357 45401 79.2% Burkholderia mallei 139359 99711 71.6% 130158 94540 72.6% Streptococcus agalactiae 152330 105878 69.5% 109821 92153 83.9% Streptococcus pneumoniae 53566 36557 68.3% 43093 33432 77.6% Myxococcus xanthus 33525 21789 65.0% 33254 21699 65.3% Dehalococcoides ethenogenes 71587 46416 64.8% 61878 42649 68.9% Listeria monocytogenes 229172 145274 63.4% 148177 123268 83.2% Streptococcus mitis 157348 92377 58.7% 106172 74203 69.9% Total 4367697 3470821 79.5% 3854419 3198082 83.0%
AutoEditoraccuracy
Organism Read length Corrections AE Errors Listeria monocytogenes 37420828 145274 4 Wolbachia sp. 11446011 51163 0 Burkholderia mallei 47407080 99711 28 Brucella suis 26629877 112359 2 Streptococcus agalactiae 23485615 105878 3 Coxiella burnetii 29135115 117232 30 Campylobacter jejuni 15013845 79237 11 Chlamydophila caviae 10286694 36972 6 Dehalococcoides ethenogenes 10724521 46416 12 Neorickettsia sennetsu Miyayama 8805232 37425 0 Fibrobacter succinogenes 46463268 196150 4 Mycoplasma capricolum 9353819 15444 0 Prevotella intermedia 20084365 94162 3 Pseudomonas syringae 50369232 177897 46 Total 346625502 1315320 149 Table 2. Comparison of AutoEditor corrections on 14 genomes to the finished sequence of those genomes.
AutoEditor accuracy