Base Calling Error Toleration in Reference Base Assembly
-
Upload
hadi-gharibi -
Category
Science
-
view
383 -
download
1
Transcript of Base Calling Error Toleration in Reference Base Assembly
Base Calling Error Toleration in Reference Based Assembly
Hadi GharibiEmail: [email protected] University of Technology
Max Planck Institute for Molecular GeneticsMay 2015
How Base Calling Error Can Be Tolerated in Next Generation Sequencing (NGS)
2
Importance
Challenges
Our Hypothesis
Our Approach
• Deal with Large Amount of Data • Impact on Sequencing Data Analysis Time and Accuracy
Researchers have developed many base calling algorithms, however, they have not resolved the tradeoff between accuracy and time complexity.
• Required Accuracy • Sequencing Data Analysis Execution Time
Base Calling Error Is Compensated in Down-stream Sequencing Steps
• Massive Data• Diverse Algorithms
Importance: Base Calling Translates Noisy Intensity Data Into Reads
3© EMBO Conference, 2014 [1]© illumina Incorporation, 2011.[2]
IntensityImage Processing
Base Calling
ReadAssemblingGenome
Challenge: Base Calling Errors Are Always Compared
4© C. Ye, 2014 [3]
Figure: Error rate for base callers per sequencing cycle on the PhiX174 test data is plotted. Accurate callers are slower than the others. [3]
Fundamental Question:
5
How Much Accuracy Is Required?
Our Approach: Analytical Assumptions and Method
6
Assumptions
• Random Genome• Single Variations• Mismatches << Read Length• Uniform Substitution Error• Equally Likely Base Errors
Method• Variant Calling for Re-sequencing
• Derive Variant Calling Errors
Analytical Results: Base Calling Error Is Tolerated by Mapping Mismatch
7
Figure: Variant Calling Error Vs. Base Calling Error
Random GenomeMismatches={2, 5, 7, 9}Genome Size ~ 4MbpRead Length= 30bpVariation Rate= 0.01
Simulation Method and Setup
8
• Generate Target Genome• Simulate Reads [4]• Add Base Calling Error• Call Variants• Calculate Variant Calling Error
Method Setup
© Gemsim, 2013[4]
Simulation Results: Simulation Verifies Analysis Predictions
9
• E-Coli Genome [5]• Mismatches= {3, 4, 5}• Genome Size ~ 4Mbp• Read Length= 30bp• Variation Rate~ 0.01• Single-end Shotgun Run • Map with SOAP[6]
Figure: Variant Calling Error Vs. Base Calling Error
© NCBI, 2014[5]© G. BGI, 2008[6]
Simulation Results: Random Genome Obviates Repeat Region Effect
10
• Genome Sizes ~ 4Mbp• Mismatches= 3• Read Length= 30bp• Variation Rate~ 0.01• Single-end Shotgun Run • Map with SOAP[6]
Figure: Random Genome Vs. E-Coli Genome
© G. BGI, 2008[6]
11
Conclusion
Simulation Results
• Confirm the Hypothesis• Genome Repeat Regions Impair Accuracy
• Confirm the Hypothesis• Higher Mismatches May Not Obey
Analytical Results
Next Steps
12
Simulation Steps• Genome Having More Repeat Regions • Develop Mapper with Higher Mismatches
• Genome Structure• Paired-end Shotgun Sequencing• Erasure Base Calling Error• Other Variant Types
Analytical Steps
References[1] EMBO Conference, “Human Evolution in the Genomic Era: Origins, Populations, and Phenotypes,” 2014, [Online]. Available: events.embo.org/14-human-evo[2] Illumina Inc., “Theory of Operation, HCS 1.4/RTA 1.12”,2011.[3] C. Ye, C. Hsiao, and H. Corrada Bravo, “BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution,” Bioinformatics, 30(9), 1214–1219, 2014. [4] C. Ledergerber and C. Dessimoz, “Base-calling for next-generation sequencing platforms”, Briefings in Bioinformatics, 2011.[5] GemSIM, “Gemsim,” 2013. [Online]. Available: http://sourceforge.net/projects/gemsim[6] NCBI, “Escherichia coli o157:h7 str. sakai dna, complete genome - nucleotide - ncbi,” 2014. [Online]. Available: http://www.ncbi.nlm.nih.gov/nuccore/47118301?report=fasta[7] G. BGI, “Soap: Short oligonucleotide analysis package,” 2008. [Online]. Available: http://soap.genomics.org.cn
13
Acknowledgement
Thank You for Your Patience, Time and Attention.
14
Danke Seher