Evaluation of the impact of error correction algorithms on SNP calling.

33

Click here to load reader

Transcript of Evaluation of the impact of error correction algorithms on SNP calling.

Page 1: Evaluation of the impact of error correction algorithms on SNP calling.

Single  Nucleo,de  Polymorphism  (SNP)  valida,on  for  food-­‐borne  pathogen  

strain  level  iden,fica,on  Nathan  Olson  

Na,onal  Ins,tute  of  Standards  and  Technology    

Page 2: Evaluation of the impact of error correction algorithms on SNP calling.

DOC/NIST  Leadership  and  Mission  

Na,onal  Oceanic  and  Atmospheric  Administra,on  

Interna,onal  Trade  

Administra,on  

U.S.  Patent  and  Trademark  

Office  

Na#onal  Ins#tute  of  Standards  &  Technology  

Economics  and  Sta,s,cs  

Administra,on  

other  agencies...  

   

Pat  Gallagher  Director  

Confirmed  as  the  14th  Director  of  NIST  on  Nov.  5,  2009  

To  promote  U.S.  innova#on  and  industrial  compe##veness  by  advancing  measurement  science,  standards,  and  technology  in  ways  that  enhance  economic  security  and  improve  our  quality  of  life.  

Rebecca  Blank  Ac,ng  Secretary  of  Commerce  

Page 3: Evaluation of the impact of error correction algorithms on SNP calling.

NIST  products  and  services  Measurement  research    §  ~  2,200  publica,ons  per  year  Standard  Reference  Data  §  ~  100  different  types  §  ~  6,000  units  sold  per  year  §  ~  226  million  data  downloads  per  year    

   

 

Standard  Reference  Materials  §  ~  1,300  products  available  §  ~  30,000  units  sold  per  year  Calibra#on  tests  §  ~  18,000  tests  per  year  Laboratory  accredita#on  §  ~  800  accredita,ons  of  tes,ng  and  calibra,ons  laboratories  per  year  

©  Rob

ert  R

athe

 

NIST  

Page 4: Evaluation of the impact of error correction algorithms on SNP calling.

NIST  and  Microbial  Iden,fica,on  Accurate  iden,fica,on  of  microorganisms  is  cri,cal  to  public  health  and  safety  and  aZribu,on.      Stakeholders  need:  •  Metrics  and  standards  to  

support  rapid  iden,fica,on  and  characteriza,on  

•  Guidance  that  defines  metrology  for  the  field  of  microbiology  

•  Methods  valida#on  for  microbial  iden#fica#on  

 

www.in.gov  hZp://www.mizozo.com  

hZp://globaldefencemedia.com  www.rosenkranz-­‐lab.com  

Page 5: Evaluation of the impact of error correction algorithms on SNP calling.

Sequence  Based  Microbial  Iden,fica,on  

Adapted  from  Welker  and  Moore  2011  Systema,c  and  Applied  Microbiology  

Page 6: Evaluation of the impact of error correction algorithms on SNP calling.

Whole  Genome  Sequencing    

–  Whole  genome  sequencing  is  emerging  as  a  rapid  means  to  iden,fy  an  organism  with  strain,  some,mes  clone  level  resolu,on.    

–  High  resolu,on  strain  differen,a,on  requires  whole  genome  single  nucleo,de  polymorphism  (SNP)  analysis.  

Page 7: Evaluation of the impact of error correction algorithms on SNP calling.

__

Reference

Single  Nucleo,de  Polymorphisms  (SNPs)  

Dele,on  

Subs,tu,on  

Inser,on  

Microbial  Genome  

Page 8: Evaluation of the impact of error correction algorithms on SNP calling.

Importance  of  Accurate  SNP  Calling    

•  Phylogeny  of  isolates  from  the  2011  Europe  E.  coli  O104:H4  outbreak  

•  Based  on  20  SNPs  –  Red  and  grey  branches  represent  French  and  German  isolates  respec,vely  without  group  strains  indicated  in  blue.    

 Adapted  from  Grad  YH  et  al.  (2012)  PNAS  

Page 9: Evaluation of the impact of error correction algorithms on SNP calling.

SNP  Calling  Procedure  

•  Strain  Isola#on  •  DNA  extrac,on  •  Library  Prepara,on  •  Sequencing  •  Base  calling  •  Read  processing  •  Mapping  •  SNP  calling  

Page 10: Evaluation of the impact of error correction algorithms on SNP calling.

SNP  Calling  Procedure  

•  Strain  Isola,on  •  DNA  extrac#on  •  Library  Prepara,on  •  Sequencing  •  Base  calling  •  Read  processing  •  Mapping  •  SNP  calling  

Page 11: Evaluation of the impact of error correction algorithms on SNP calling.

SNP  Calling  Procedure  

•  Strain  Isola,on  •  DNA  extrac,on  •  Library  Prepara#on  •  Sequencing  •  Base  calling  •  Read  processing  •  Mapping  •  SNP  calling  

Rothberg  et  al  2008  Nature  

Page 12: Evaluation of the impact of error correction algorithms on SNP calling.

SNP  Calling  Procedure  

•  Strain  Isola,on  •  DNA  extrac,on  •  Library  Prepara,on  •  Sequencing  •  Base  calling  •  Read  processing  •  Mapping  •  SNP  calling   Rothberg  et  al  2011  

Page 13: Evaluation of the impact of error correction algorithms on SNP calling.

SNP  Calling  Procedure  

•  Strain  Isola,on  •  DNA  extrac,on  •  Library  Prepara,on  •  Sequencing  •  Base  calling  •  Read  processing  •  Mapping  •  SNP  calling  

Base  Calls   Quality  Score  

Page 14: Evaluation of the impact of error correction algorithms on SNP calling.

SNP  Calling  Procedure  

•  Strain  Isola,on  •  DNA  extrac,on  •  Library  Prepara,on  •  Sequencing  •  Base  calling  •  Read  processing  •  Mapping  •  SNP  calling  

Focus  of  this  study  Quality  Trimming  

 Removing:      low  quality  bases      short  reads      trim  long  reads  

Error  correc,on    correct  predicted  errors  

Page 15: Evaluation of the impact of error correction algorithms on SNP calling.

SNP  Calling  Procedure  

•  Strain  Isola,on  •  DNA  extrac,on  •  Library  Prepara,on  •  Sequencing  •  Base  calling  •  Read  processing  • Mapping  •  SNP  calling  

Kalari  et  al  2012  Bioinforma,cs      

Page 16: Evaluation of the impact of error correction algorithms on SNP calling.

•  Strain  Isola,on  •  DNA  extrac,on  •  Library  Prepara,on  •  Sequencing  •  Base  calling  •  Read  processing  •  Mapping  •  SNP  calling  

Bayes  Theorem   GATK  Variant  Caller    

McKenna  et  al  2010  Genome  Research  

SNP  Calling  Procedure  

Kalari  et  al  2012  Bioinforma,cs      

Page 17: Evaluation of the impact of error correction algorithms on SNP calling.

Evalua,on  of  Read  Processing  

•  Perform  whole  genome  sequencing    – S.  typhimurium  LT2    – using  Ion  Torrent  PGM  

•  Read  datasets  –  Raw  reads  –  Trimmed  based  on  quality  score  and  length  –  Error  corrected  

•  Compare  read  mapping  and  SNP  calls  –  True  reference  genome  –  Two  genomes  with  in  silico  muta,ons  

Page 18: Evaluation of the impact of error correction algorithms on SNP calling.

Food  Outbreak  Relevant  Strain  

•  Salmonella  enterica  subsp.  enterica  serovar  Typhimurium  str.  LT2  (S.  typhimurium  LT2)  –  Gammaproteobacteria  in  the  family  Enterobacteriaceae  –  Causes  gasteroenteri,s  and  food  poisoning  

Year   Source   Infec#ons*  

2012   Hedgehogs   14  /  6  /0  

2011   Ground  Beef   20  /  8  /0  

2011   African  Dwarf  Frogs   241/72/0  

2010   Teaching  and  Clinical  Labs   109/13/1  

2009   Water  Frogs   85/29/0   hZp://www.lookfordiagnosis.com/  

Outbreaks  2012-­‐2009  

*Cases  /  Hospital  /Death  Data  from  the  www.cdc.gov  

Page 19: Evaluation of the impact of error correction algorithms on SNP calling.

Salmonella  typhimurium  LT2  Genome      

 First  sequenced  in  2001  4.8  Mb  Chromosome            94  kb  plasmid  

McClelland  et  al  2001  Science  

Page 20: Evaluation of the impact of error correction algorithms on SNP calling.

Evalua,on  of  Read  Processing  •  Perform  whole  genome  sequencing    

–  S.  typhimurium  LT2    –  using  Ion  Torrent  PGM  

•  Read  datasets  – Raw  reads  – Trimmed  based  on  quality  score  and  length  – Error  corrected  

•  Compare  read  mapping  and  SNP  calls  –  True  reference  genome  –  Two  genomes  with  in  silico  muta,ons  

Page 21: Evaluation of the impact of error correction algorithms on SNP calling.

Quality  Trimming  Short  Reads  

Page 22: Evaluation of the impact of error correction algorithms on SNP calling.

Error  Correc,on  

Salamela  and  Schroder  2011  Bioinforma,cs  

Page 23: Evaluation of the impact of error correction algorithms on SNP calling.

Evalua,on  of  Read  Processing  •  Perform  whole  genome  sequencing    –  S.  typhimurium  LT2    –  using  Ion  Torrent  PGM  

•  Read  datasets  –  Raw  reads  –  Trimmed  based  on  quality  score  and  length  –  Error  corrected  

•  Compare  read  mapping  and  SNP  calls  – True  reference  genome  – Two  genomes  with  in  silico  muta#ons  

Page 24: Evaluation of the impact of error correction algorithms on SNP calling.

In  silico  Mutated  Genomes  

•  Two  ar,ficially  mutated  genomes  – 480  variants  and  5044  variants  

•  Use  evaluate  the  variant  calling  sensi,vity  

ACGTGCACGTACGTCAC-GGACTTTACGACC ACGTGGACGTA-GTCACTGGACTTTACGACC

True  Ref  

Mutated  Ref  

Subs,tu,on   Inser,on  Dele,on  

Page 25: Evaluation of the impact of error correction algorithms on SNP calling.

Mapping  and  SNP  calling  Read  Dataset   Read  Processing   Reference     480  known  SNPs   5044  known  SNPs  

 

Raw   Torrent  Suite   x   x   x  

Raw   TMAP  +  GATK   x   x   x  

Quality  Trimmed   TMAP  +  GATK   x   x   x  

Error  Corrected   TMAP  +  GATK   x   x   x  

TMAP  –  Torrent  mapping  algorithm  hZps://github.com/iontorrent/TMAP  GATK  –  Genome  Analysis  Toolkit    (McKenna  et  al  2010  Genome  Research)      

Page 26: Evaluation of the impact of error correction algorithms on SNP calling.

Read  Datasets  •  Raw  reads  

–  2.2  M  reads    –  Read  length  mean  181  bp    range  6  bp  –  386  bp  

•  Error  correc,on  using  Coral  –  CORrec,on  using  ALignment  (Coral)  –  Parameters:  similar  gap  and  mismatch  penal,es  –  2.2  M  reads    –  Read  length  mean  181  bp    range  6  bp  –  386  bp  

•  Read  Trimming    –  Parameters  

•  Size  cutoffs:  100  –  350  bp  •  Quality  trimming:  mean  99%  accuracy  as  defined  by  the  instrument  

–  1.7  M  reads  –  Read  length  mean  205  bp    range  100  bp  –  349  bp  

 

Page 27: Evaluation of the impact of error correction algorithms on SNP calling.

Chromosome Plasmid

65

70

75

80

110

130

150

170

0 1e3 2e3 3e3 4e3 5e3 0 2.5 5.0 7.5Genome Position (Kb)

Cov

erag

e Data SetError Corrected

Quality Trim

Raw

Comparable  coverage  paZern  for  all  data  sets  

Mapping  Coverage  

Similar  coverage  paZern  for  mutated  and  true  reference  genomes  and  mapping  algorithms  

Page 28: Evaluation of the impact of error correction algorithms on SNP calling.

Comparison  of  SNP  Call  Abundance  Reference 480 SNPs 5044 SNPs

0

2000

4000

6000

Raw Qual Error Raw Qual Error Raw Qual ErrorRead Dataset

SNP

Cou

nts

Mapping Algorithm and SNP Caller

Torrent Suite

TMAP + GATK

Page 29: Evaluation of the impact of error correction algorithms on SNP calling.

Distribu,on  of  SNP  Call  Quality  Scores  

Raw Quality Trim Error Corrected

0200400600

0200400600

0200400600

Reference

480 SNPs

5044 SNPs

0 1 2 3 4 0 1 2 3 4 0 1 2 3 4SNP Quality Score (1000×)

SNP

coun

t

Error  Corrected  SNPs  have  lower  quality  scores  than  introduced  SNPs  

Page 30: Evaluation of the impact of error correction algorithms on SNP calling.

SNP  call  sensi,vity  for  introduced  variants  

480  known  SNPs   Raw   Quality  Trim   Error  Correc#on   Torrent  Suite  

True  Posi,ve   411   410   454   464  False  Nega,ve   69   70   26   16  

5044  known  SNPs   Raw   Quality  Trim   Error  Correc#on   Torrent  Suite  

True  Posi,ve   4403*   4403*   4814   4809  

False  Nega,ve   641   641   230   234  

*  Not  the  same  variants  

Both  Error  Corrected  and  the  Torrent  Suite  have  the  highest  sensi,vity  

Page 31: Evaluation of the impact of error correction algorithms on SNP calling.

Conclusion  

•  Use  of  error  correc,on  method  increases  coverage  but  may  adversely  affects  SNP  calling  specificity  

•  Future  work  to  develop  reference  datasets  and  analysis  tools  for  use  in  algorithm  tes,ng  and  pipeline  valida,on  

Page 32: Evaluation of the impact of error correction algorithms on SNP calling.

Acknowledgements  

•  Team:  Dr.  Jayne  Morrow  (leader),      Sandra  Da  Silva,  and  Lindsay  Vang  

•  Jenny  McDaniel,  Jus,n  Zook  and  Steve  Lund  

Contact  Informa,on:      Nathan  Olson      [email protected]  

 

Page 33: Evaluation of the impact of error correction algorithms on SNP calling.

Disclaimer"

Material  Measurement  Laboratory  

Certain  commercial  equipment,  instruments,  or  materials  are  iden9fied  in  this  presenta9on    in  order  to  specify  the  experimental  procedure  adequately.  Such  iden9fica9on  is  not  intended  to  imply  recommenda9on  or  endorsement  by  the  Na9onal  Ins9tute  of  Standards  and  Technology,  or  affiliated  venues  nor  is  it  intended  to  imply  that  the  materials  or  equipment  iden9fied  are  necessarily  the  best  available  for  the  purpose.    All  opinions  expressed  in  this  presenta9on  are  the  authors’  and  do  not  necessarily  reflect  the  policies  and  views  of    NIST  or  affiliated  venues.