Improving the accuracy of taxonomic classification for ... · organism actually belongs to a novel...

1
IDTAXA avoids misclassifying sequences belonging to novel taxonomic groups that are not represented in existing taxonomic databases, which is the predominant type of error made by current classifiers. For example, the popular RDP Classifier incorrectly assigns 26.0% of novel 16S rRNA sequences to an existing taxonomic group when the organism actually belongs to a novel taxonomic group. In contrast, IDTAXA only incorrectly classifies 13.6% of such sequences, while correspondingly improving on the fraction of sequences correctly classified to known taxonomic groups. 2. How well does the IDTAXA algorithm work for taxonomic assignment? Improving the accuracy of taxonomic classification for identifying taxa in microbiome samples Adithya Murali 1 , Aniruddha Bhargava 2 , & Erik S. Wright 3 1 Department of Computer Sciences, University of Wisconsin–Madison 2 Robotics, Amazon, Inc. 3 Department of Biomedical Informatics, University of Pittsburgh Contact info: [email protected] WrightLabScience.com a. Comparing performance with different reference taxonomies It has become increasingly clear that the microbiome is an essential component of human and ecosystem health. Microbiome studies frequently involve sequencing a taxonomic marker, such as the 16S rRNA or ITS, to identify the microorganisms that are present in a sample of interest. We have developed a new method, named IDTAXA, for taxonomic classification of marker gene sequences that exhibits a substantially lower error rate than previous approaches. Introduction Results Conclusions IDTAXA exhibits substantially lower error rates when classifying novel sequences that are unrepresented in the reference taxonomy. This considerably alters the interpretation of microbiome data because many microbial communities contain a large fraction of previously unknown microorganisms that are not yet represented in taxonomic databases. Collectively, these improvements often lead to substantially different classifications on real microbiome data, which may considerably alter its interpretation. b. IDTAXA's confidence is more correlated with percent identity 3. IDTAXA is available as a part of the R package DECIPHER or online a. How to classify new sequences with the IDTAXA algorithm: b. Visit http://DECIPHER.codes 1. Classifying marker gene (e.g., 16S) sequences into a taxonomy a. Given a reference taxonomy with sequence representatives b. The goal is to predict the taxon of a microbiome sequence AGCGGCAGCACAGAGGAACTTGTTCCTTGG... Root (97.8%); Bacteria (97.8%); Proteobacteria (97.8%); Gammaproteobacteria (97.8%); Enterobacteriales (97.8%); Enterobacteriaceae (97.8%); Escherichia (95%); E. coli (82%) Root Archaea Bacteria β ε Actinobacteria Bacteroidetes Proteobacteria ... ... Aeromonadales Enterobacteriales Legionellales ... ... Enterobacteriaceae Citrobacter Erwinia Salmonella Escherichia ... ... E. coli E. albertii E. fergusonii Phylum: Class: Order: Family: Genus: Species: E. marmotae E. vulneris ... assign to a taxon new marker gene sequence (e.g., 16S rRNA): Taxonomic assignment: confidences at each rank level 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 BLAST BLASTGLOBAL RDP SINTAX SPINGO IDTAXA OC MC 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Error rate BLAST BLASTGLOBAL RDP SINTAX SPINGO IDTAXA OC MC 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of classifiable sequences classified Error rate BLAST BLASTGLOBAL RDP SINTAX SPINGO IDTAXA OC MC 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of classifiable sequences classified BLAST BLASTGLOBAL RDP SINTAX SPINGO IDTAXA OC MC RDP training set (genus-level 16S) Contax training set (genus-level 16S) Warcup training set (species-level ITS) V4 region of RDP (genus-level 16S) LearnTaxa IdTaxa classifications iterative refinement of the training set training set test sequences + Lawsonella Mycobacterium Paenarthrobacter Dysgonomonas Sediminibacterium Algoriphagus Flavobacterium Vampirovibrionales Staphylococcus Erysipelothrix Nitrospira Brevundimonas Methylobacterium Hyphomicrobium Phreatobacter Afipia Ferrovibrio DSSF69 Novosphingobium Porphyrobacter Sphingomonas Sphingopyxis unclassified_Sphi... Limnobacter unclassified_Burkhol... Nitrosomonas Azonexus Acinetobacter Pseudomonas Stenotrophomonas unclassified_Root domain phylum class order family genus ●●● 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0 20 40 60 80 100 0 20 40 60 80 100 Confidence Distance to nearest sequence in classified group Confidence of classification IDTAXA SPINGO RDP SINTAX Example output from classifying a microbiome sample

Transcript of Improving the accuracy of taxonomic classification for ... · organism actually belongs to a novel...

Page 1: Improving the accuracy of taxonomic classification for ... · organism actually belongs to a novel taxonomic group. In contrast, IDTAXA only incorrectly classifies 13.6% of such

IDTAXA avoids misclassifying sequences belonging to novel taxonomic groups that are not represented in existing taxonomic databases, which is the predominant type of error made by current classifiers. For example, the popular RDP Classifier incorrectly assigns 26.0% of novel 16S rRNA sequences to an existing taxonomic group when the organism actually belongs to a novel taxonomic group. In contrast, IDTAXA only incorrectly classifies 13.6% of such sequences, while correspondingly improving on the fraction of sequences correctly classified to known taxonomic groups.

2. How well does the IDTAXA algorithm work for taxonomic assignment?

Improving the accuracy of taxonomic classification for identifying taxa in microbiome samples

Adithya Murali1, Aniruddha Bhargava2, & Erik S. Wright3 1Department of Computer Sciences, University of Wisconsin–Madison 2Robotics, Amazon, Inc. 3Department of Biomedical Informatics, University of Pittsburgh

Contact info: [email protected]

a. Comparing performance with different reference taxonomies

It has become increasingly clear that the microbiome is an essential component of human and ecosystem health. Microbiome studies frequently involve sequencing a taxonomic marker, such as the 16S rRNA or ITS, to identify the microorganisms that are present in a sample of interest. We have developed a new method, named IDTAXA, for taxonomic classification of marker gene sequences that exhibits a substantially lower error rate than previous approaches.

Introduction

Results

ConclusionsIDTAXA exhibits substantially lower error rates when classifying novel sequences that are unrepresented in the reference taxonomy. This considerably alters the interpretation of microbiome data because many microbial communities contain a large fraction of previously unknown microorganisms that are not yet represented in taxonomic databases. Collectively, these improvements often lead to substantially different classifications on real microbiome data, which may considerably alter its interpretation.

b. IDTAXA's confidence is more correlated with percent identity

3. IDTAXA is available as a part of the R package DECIPHER or onlinea. How to classify new sequences

with the IDTAXA algorithm:b. Visit http://DECIPHER.codes

1. Classifying marker gene (e.g., 16S) sequences into a taxonomya. Given a reference taxonomy

with sequence representativesb. The goal is to predict the taxon

of a microbiome sequence

AGCGGCAGCACAGAGGAACTTGTTCCTTGG...

Root (97.8%);Bacteria (97.8%);

Proteobacteria (97.8%);Gammaproteobacteria (97.8%);

Enterobacteriales (97.8%);Enterobacteriaceae (97.8%);

Escherichia (95%);E. coli (82%)

Root

Archaea Bacteria

⍺ β ! " ε

Actinobacteria Bacteroidetes Proteobacteria ......

Aeromonadales Enterobacteriales Legionellales ......

Enterobacteriaceae

Citrobacter Erwinia SalmonellaEscherichia ......

E. coli

E. albert

ii

E. fergus

onii

Phylum:

Class:

Order:

Family:

Genus:

Species:

E. marm

otae

E. vuln

eris

...

assign to a taxon

new marker gene sequence (e.g., 16S rRNA):

Taxonomic assignment:

confidences at each rank level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Fraction of classifiable sequences classified

Erro

r rat

e

●●

BLASTBLAST−GLOBALRDPSINTAXSPINGOIDTAXA

OC MC

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Fraction of classifiable sequences classified

Erro

r rat

e

BLASTBLAST−GLOBALRDPSINTAXSPINGOIDTAXA

OC MC

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Fraction of classifiable sequences classified

Erro

r rat

e ●

BLASTBLAST−GLOBALRDPSINTAXSPINGOIDTAXA

OC MC

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Fraction of classifiable sequences classified

Erro

r rat

e

BLASTBLAST−GLOBALRDPSINTAXSPINGOIDTAXA

OC MC

RDP training set(genus-level 16S)

Contax training set(genus-level 16S)

Warcup training set(species-level ITS)

V4 region of RDP(genus-level 16S)

LearnTaxa

IdTaxa

classifications

iterative refinement of the training set

training set

test sequences

+

Lawsonella

Mycobacterium

Paenarthrobacter

Dysgonomonas

Sediminibacterium

Algoriphagus

Flavobacterium

Vampirovibrionales

Staphylococcus

Erysipelothrix

Nitrospira

Brevundimonas

Methylobacterium

Hyphomicrobium

PhreatobacterAfi

piaFerrovibrioDS

SF69

Novosphing

obiumPorp

hyrobact

erSphingomona

sSphingopyxis

unclassified_Sphi...

Limnobacter

unclassified_Burkhol...

Nitrosomonas

Azonexus

AcinetobacterPseudom

onasStenotrophom

onas

unclassified_Root

domain

phylum

class

order

family

genus

●●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

0.0

0.1

0.2

0.3

0.4

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●● ●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

0.0

0.1

0.2

0.3

0.4

0 20 40 60 80 100

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

0 20 40 60 80 100

Confidence

Dis

tanc

e to

cla

ssifi

catio

nDi

stan

ce to

nea

rest

seq

uenc

e in

cla

ssifie

d gr

oup

Confidence of classification

IDTAXA

SPINGO RDP

SINTAX

Example output from classifying a microbiome

sample