Evaluation of the impact of error correction algorithms on SNP calling.
Click here to load reader
-
Upload
nathan-olson -
Category
Science
-
view
94 -
download
0
Transcript of Evaluation of the impact of error correction algorithms on SNP calling.
Single Nucleo,de Polymorphism (SNP) valida,on for food-‐borne pathogen
strain level iden,fica,on Nathan Olson
Na,onal Ins,tute of Standards and Technology
DOC/NIST Leadership and Mission
Na,onal Oceanic and Atmospheric Administra,on
Interna,onal Trade
Administra,on
U.S. Patent and Trademark
Office
Na#onal Ins#tute of Standards & Technology
Economics and Sta,s,cs
Administra,on
other agencies...
Pat Gallagher Director
Confirmed as the 14th Director of NIST on Nov. 5, 2009
To promote U.S. innova#on and industrial compe##veness by advancing measurement science, standards, and technology in ways that enhance economic security and improve our quality of life.
Rebecca Blank Ac,ng Secretary of Commerce
NIST products and services Measurement research § ~ 2,200 publica,ons per year Standard Reference Data § ~ 100 different types § ~ 6,000 units sold per year § ~ 226 million data downloads per year
Standard Reference Materials § ~ 1,300 products available § ~ 30,000 units sold per year Calibra#on tests § ~ 18,000 tests per year Laboratory accredita#on § ~ 800 accredita,ons of tes,ng and calibra,ons laboratories per year
© Rob
ert R
athe
NIST
NIST and Microbial Iden,fica,on Accurate iden,fica,on of microorganisms is cri,cal to public health and safety and aZribu,on. Stakeholders need: • Metrics and standards to
support rapid iden,fica,on and characteriza,on
• Guidance that defines metrology for the field of microbiology
• Methods valida#on for microbial iden#fica#on
www.in.gov hZp://www.mizozo.com
hZp://globaldefencemedia.com www.rosenkranz-‐lab.com
Sequence Based Microbial Iden,fica,on
Adapted from Welker and Moore 2011 Systema,c and Applied Microbiology
Whole Genome Sequencing
– Whole genome sequencing is emerging as a rapid means to iden,fy an organism with strain, some,mes clone level resolu,on.
– High resolu,on strain differen,a,on requires whole genome single nucleo,de polymorphism (SNP) analysis.
__
Reference
Single Nucleo,de Polymorphisms (SNPs)
Dele,on
Subs,tu,on
Inser,on
Microbial Genome
Importance of Accurate SNP Calling
• Phylogeny of isolates from the 2011 Europe E. coli O104:H4 outbreak
• Based on 20 SNPs – Red and grey branches represent French and German isolates respec,vely without group strains indicated in blue.
Adapted from Grad YH et al. (2012) PNAS
SNP Calling Procedure
• Strain Isola#on • DNA extrac,on • Library Prepara,on • Sequencing • Base calling • Read processing • Mapping • SNP calling
SNP Calling Procedure
• Strain Isola,on • DNA extrac#on • Library Prepara,on • Sequencing • Base calling • Read processing • Mapping • SNP calling
SNP Calling Procedure
• Strain Isola,on • DNA extrac,on • Library Prepara#on • Sequencing • Base calling • Read processing • Mapping • SNP calling
Rothberg et al 2008 Nature
SNP Calling Procedure
• Strain Isola,on • DNA extrac,on • Library Prepara,on • Sequencing • Base calling • Read processing • Mapping • SNP calling Rothberg et al 2011
SNP Calling Procedure
• Strain Isola,on • DNA extrac,on • Library Prepara,on • Sequencing • Base calling • Read processing • Mapping • SNP calling
Base Calls Quality Score
SNP Calling Procedure
• Strain Isola,on • DNA extrac,on • Library Prepara,on • Sequencing • Base calling • Read processing • Mapping • SNP calling
Focus of this study Quality Trimming
Removing: low quality bases short reads trim long reads
Error correc,on correct predicted errors
SNP Calling Procedure
• Strain Isola,on • DNA extrac,on • Library Prepara,on • Sequencing • Base calling • Read processing • Mapping • SNP calling
Kalari et al 2012 Bioinforma,cs
• Strain Isola,on • DNA extrac,on • Library Prepara,on • Sequencing • Base calling • Read processing • Mapping • SNP calling
Bayes Theorem GATK Variant Caller
McKenna et al 2010 Genome Research
SNP Calling Procedure
Kalari et al 2012 Bioinforma,cs
Evalua,on of Read Processing
• Perform whole genome sequencing – S. typhimurium LT2 – using Ion Torrent PGM
• Read datasets – Raw reads – Trimmed based on quality score and length – Error corrected
• Compare read mapping and SNP calls – True reference genome – Two genomes with in silico muta,ons
Food Outbreak Relevant Strain
• Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 (S. typhimurium LT2) – Gammaproteobacteria in the family Enterobacteriaceae – Causes gasteroenteri,s and food poisoning
Year Source Infec#ons*
2012 Hedgehogs 14 / 6 /0
2011 Ground Beef 20 / 8 /0
2011 African Dwarf Frogs 241/72/0
2010 Teaching and Clinical Labs 109/13/1
2009 Water Frogs 85/29/0 hZp://www.lookfordiagnosis.com/
Outbreaks 2012-‐2009
*Cases / Hospital /Death Data from the www.cdc.gov
Salmonella typhimurium LT2 Genome
First sequenced in 2001 4.8 Mb Chromosome 94 kb plasmid
McClelland et al 2001 Science
Evalua,on of Read Processing • Perform whole genome sequencing
– S. typhimurium LT2 – using Ion Torrent PGM
• Read datasets – Raw reads – Trimmed based on quality score and length – Error corrected
• Compare read mapping and SNP calls – True reference genome – Two genomes with in silico muta,ons
Quality Trimming Short Reads
Error Correc,on
Salamela and Schroder 2011 Bioinforma,cs
Evalua,on of Read Processing • Perform whole genome sequencing – S. typhimurium LT2 – using Ion Torrent PGM
• Read datasets – Raw reads – Trimmed based on quality score and length – Error corrected
• Compare read mapping and SNP calls – True reference genome – Two genomes with in silico muta#ons
In silico Mutated Genomes
• Two ar,ficially mutated genomes – 480 variants and 5044 variants
• Use evaluate the variant calling sensi,vity
ACGTGCACGTACGTCAC-GGACTTTACGACC ACGTGGACGTA-GTCACTGGACTTTACGACC
True Ref
Mutated Ref
Subs,tu,on Inser,on Dele,on
Mapping and SNP calling Read Dataset Read Processing Reference 480 known SNPs 5044 known SNPs
Raw Torrent Suite x x x
Raw TMAP + GATK x x x
Quality Trimmed TMAP + GATK x x x
Error Corrected TMAP + GATK x x x
TMAP – Torrent mapping algorithm hZps://github.com/iontorrent/TMAP GATK – Genome Analysis Toolkit (McKenna et al 2010 Genome Research)
Read Datasets • Raw reads
– 2.2 M reads – Read length mean 181 bp range 6 bp – 386 bp
• Error correc,on using Coral – CORrec,on using ALignment (Coral) – Parameters: similar gap and mismatch penal,es – 2.2 M reads – Read length mean 181 bp range 6 bp – 386 bp
• Read Trimming – Parameters
• Size cutoffs: 100 – 350 bp • Quality trimming: mean 99% accuracy as defined by the instrument
– 1.7 M reads – Read length mean 205 bp range 100 bp – 349 bp
Chromosome Plasmid
65
70
75
80
110
130
150
170
0 1e3 2e3 3e3 4e3 5e3 0 2.5 5.0 7.5Genome Position (Kb)
Cov
erag
e Data SetError Corrected
Quality Trim
Raw
Comparable coverage paZern for all data sets
Mapping Coverage
Similar coverage paZern for mutated and true reference genomes and mapping algorithms
Comparison of SNP Call Abundance Reference 480 SNPs 5044 SNPs
0
2000
4000
6000
Raw Qual Error Raw Qual Error Raw Qual ErrorRead Dataset
SNP
Cou
nts
Mapping Algorithm and SNP Caller
Torrent Suite
TMAP + GATK
Distribu,on of SNP Call Quality Scores
Raw Quality Trim Error Corrected
0200400600
0200400600
0200400600
Reference
480 SNPs
5044 SNPs
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4SNP Quality Score (1000×)
SNP
coun
t
Error Corrected SNPs have lower quality scores than introduced SNPs
SNP call sensi,vity for introduced variants
480 known SNPs Raw Quality Trim Error Correc#on Torrent Suite
True Posi,ve 411 410 454 464 False Nega,ve 69 70 26 16
5044 known SNPs Raw Quality Trim Error Correc#on Torrent Suite
True Posi,ve 4403* 4403* 4814 4809
False Nega,ve 641 641 230 234
* Not the same variants
Both Error Corrected and the Torrent Suite have the highest sensi,vity
Conclusion
• Use of error correc,on method increases coverage but may adversely affects SNP calling specificity
• Future work to develop reference datasets and analysis tools for use in algorithm tes,ng and pipeline valida,on
Acknowledgements
• Team: Dr. Jayne Morrow (leader), Sandra Da Silva, and Lindsay Vang
• Jenny McDaniel, Jus,n Zook and Steve Lund
Contact Informa,on: Nathan Olson [email protected]
Disclaimer"
Material Measurement Laboratory
Certain commercial equipment, instruments, or materials are iden9fied in this presenta9on in order to specify the experimental procedure adequately. Such iden9fica9on is not intended to imply recommenda9on or endorsement by the Na9onal Ins9tute of Standards and Technology, or affiliated venues nor is it intended to imply that the materials or equipment iden9fied are necessarily the best available for the purpose. All opinions expressed in this presenta9on are the authors’ and do not necessarily reflect the policies and views of NIST or affiliated venues.