presentation
Transcript of presentation
DNA Barcode sequence identification incorporating taxonomic hierarchy and
within taxon variability
Damon P. Little
Cullman Program for Molecular Systematics StudiesThe New York Botanical Garden, Bronx, New York
test data sets (Little and Stevenson 2007)
gymnosperm nuclear ribosomal internal transcribed spacer 2 (nrITS 2)
1,037 sequences413 species71 genera
gymnosperm plastid encoded maturase K (matK)
522 sequences334 species75 genera
…alignment
3,906 bp1,601 (1,530–1,661) bpone per species
3,975 bp1,561 (1,412–1,661) bpall
matK
6,778 bp196 (115–260) bpone per species
8,733 bp137 (108–250) bpall
nrITS 2
aligned length
median unaligned length (IQR)
sequenceslocus
pairwise divergence
0.42%8.13–23.89%21.38%one per species
0.54%5.95–23.30%20.39%all
matK
0.21%25.75–33.30%29.39%one per species
0.09%26.53–34.48%30.99%all
nrITS 2
zero comparisons
interquartile rangemediansequenceslocus
precision
matKnrITS2method
100 (67%)100% (83%)ATIM
60% (60%)80% (80%)DOME ID
100% (79%)98% (89%)DNA–BAR
95% (56%)93% (71%)BLAST/neighbor joining
76% (53%)87% (73%)BLAST/SPR
77% (55%)86% (74%)BLAST/parsimony ratchet
99% (61%)94% (80%)megaBLAST
99% (69%)94% (82%)BLAT
99% (67%)94% (81%)BLAST
44% (23%)65% (8%)neighbor joining
70% (41%)60% (11%)SPR search
71% (41%)58% (13%)parsimony ratchet
accuracy to species
matKnrITS2method
87% (53%)83% (71%)ATIM
50% (50%)67% (66%)DOME ID
73% (62%)65% (62%)DNA–BAR
86% (56%)80% (64%)BLAST/neighbor joining
78% (61%)79% (67%)BLAST/SPR
80% (60%)78% (67%)BLAST/parsimony ratchet
84% (64%)72% (68%)megaBLAST
82% (67%)66% (62%)BLAT
84% (68%)67% (63%)BLAST
75% (52%)68% (42%)neighbor joining
78% (58%)69% (47%)SPR search
77% (60%)67% (46%)parsimony ratchet
precision
matKnrITS2method
100 (67%)100% (83%)ATIM
60% (60%)80% (80%)DOME ID
100% (79%)98% (89%)DNA–BAR
95% (56%)93% (71%)BLAST/neighbor joining
76% (53%)87% (73%)BLAST/SPR
77% (55%)86% (74%)BLAST/parsimony ratchet
99% (61%)94% (80%)megaBLAST
99% (69%)94% (82%)BLAT
99% (67%)94% (81%)BLAST
44% (23%)65% (8%)neighbor joining
70% (41%)60% (11%)SPR search
71% (41%)58% (13%)parsimony ratchet
accuracy to species
matKnrITS2method
87% (53%)83% (71%)ATIM
50% (50%)67% (66%)DOME ID
73% (62%)65% (62%)DNA–BAR
86% (56%)80% (64%)BLAST/neighbor joining
78% (61%)79% (67%)BLAST/SPR
80% (60%)78% (67%)BLAST/parsimony ratchet
84% (64%)72% (68%)megaBLAST
82% (67%)66% (62%)BLAT
84% (68%)67% (63%)BLAST
75% (52%)68% (42%)neighbor joining
78% (58%)69% (47%)SPR search
77% (60%)67% (46%)parsimony ratchet
precision
matKnrITS2method
100 (67%)100% (83%)ATIM
60% (60%)80% (80%)DOME ID
100% (79%)98% (89%)DNA–BAR
95% (56%)93% (71%)BLAST/neighbor joining
76% (53%)87% (73%)BLAST/SPR
77% (55%)86% (74%)BLAST/parsimony ratchet
99% (61%)94% (80%)megaBLAST
99% (69%)94% (82%)BLAT
99% (67%)94% (81%)BLAST
44% (23%)65% (8%)neighbor joining
70% (41%)60% (11%)SPR search
71% (41%)58% (13%)parsimony ratchet
accuracy to species
matKnrITS2method
87% (53%)83% (71%)ATIM
50% (50%)67% (66%)DOME ID
73% (62%)65% (62%)DNA–BAR
86% (56%)80% (64%)BLAST/neighbor joining
78% (61%)79% (67%)BLAST/SPR
80% (60%)78% (67%)BLAST/parsimony ratchet
84% (64%)72% (68%)megaBLAST
82% (67%)66% (62%)BLAT
84% (68%)67% (63%)BLAST
75% (52%)68% (42%)neighbor joining
78% (58%)69% (47%)SPR search
77% (60%)67% (46%)parsimony ratchet
precision
100% (100%)100% (100%)DOME ID*
matKnrITS2method
100 (67%)100% (83%)ATIM
60% (60%)80% (80%)DOME ID
100% (79%)98% (89%)DNA–BAR
95% (56%)93% (71%)BLAST/neighbor joining
76% (53%)87% (73%)BLAST/SPR
77% (55%)86% (74%)BLAST/parsimony ratchet
99% (61%)94% (80%)megaBLAST
99% (69%)94% (82%)BLAT
99% (67%)94% (81%)BLAST
44% (23%)65% (8%)neighbor joining
70% (41%)60% (11%)SPR search
71% (41%)58% (13%)parsimony ratchet
accuracy to species
90% (90%)76% (75%)DOME ID*
matKnrITS2method
87% (53%)83% (71%)ATIM
50% (50%)67% (66%)DOME ID
73% (62%)65% (62%)DNA–BAR
86% (56%)80% (64%)BLAST/neighbor joining
78% (61%)79% (67%)BLAST/SPR
80% (60%)78% (67%)BLAST/parsimony ratchet
84% (64%)72% (68%)megaBLAST
82% (67%)66% (62%)BLAT
84% (68%)67% (63%)BLAST
75% (52%)68% (42%)neighbor joining
78% (58%)69% (47%)SPR search
77% (60%)67% (46%)parsimony ratchet
...remaining (insoluble) problems
identical sequences for multiple terminals
shared alleles between terminalsuse allele frequency as a predictor?
Sequence IDentification Engines (SIDEs)
avoid global alignment by comparing short segments: pseudo–alignment
use exact matches
use autoapomorphies where possible...but allow the use of other characters too
context/text DNA recoding
characters are defined by flanking context=> pretext and postextpermit “alignment–free” comparisonssize and separation between pretext and postext must be arbitrarily delimited
states (text) limited by the proximity of context
terminals can be individual sequences or composites representing taxa
context/text DNA recoding
characters are defined by flanking context=> pretext and postextpermit “alignment–free” comparisonssize and separation between pretext and postext is arbitrarily
possible states (text) is limited by the length of the text
terminals can be individual sequences or composites representing taxa
querying text/context database
find pretext/text/postext in the query sequence and match to references
querying text/context database
find pretext/text/postext in the query sequence and match to references
score terminals based on the number of matches
final score can be raw or based a weighting function
possible weighting functions
equal weights (raw score)
number of distinct texts=> up weights more variable characters
1/(number of distinct texts)=> down weights more variable characters
(number of texts)/(number of scores)
precision
98% (79%) 96% (86%)BRONX 1
88% (84%)91% (90%)BRONX 0
matKnrITS2method
100 (67%)100% (83%)ATIM
60% (60%)80% (80%)DOME ID
100% (79%)98% (89%)DNA–BAR
95% (56%)93% (71%)BLAST/neighbor joining
76% (53%)87% (73%)BLAST/SPR
77% (55%)86% (74%)BLAST/parsimony ratchet
99% (61%)94% (80%)megaBLAST
99% (69%)94% (82%)BLAT
99% (67%)94% (81%)BLAST
44% (23%)65% (8%)neighbor joining
70% (41%)60% (11%)SPR search
71% (41%)58% (13%)parsimony ratchet
accuracy to species
92% (75%)72% (67%)BRONX 1
76% (71%)59% (58%)BRONX 0
matKnrITS2method
87% (53%)83% (71%)ATIM
50% (50%)67% (66%)DOME ID
73% (62%)65% (62%)DNA–BAR
86% (56%)80% (64%)BLAST/neighbor joining
78% (61%)79% (67%)BLAST/SPR
80% (60%)78% (67%)BLAST/parsimony ratchet
84% (64%)72% (68%)megaBLAST
82% (67%)66% (62%)BLAT
84% (68%)67% (63%)BLAST
75% (52%)68% (42%)neighbor joining
78% (58%)69% (47%)SPR search
77% (60%)67% (46%)parsimony ratchet
BRONX conclusions
BRONX is more precise than existing algorithms
BRONX is sometimes more accurate than existing algorithms
BRONX is an incremental improvement
future directions
improve the scoring function in BRONX
dynamically size context/text
benchmark additional datasets for all methods
incorporate context/text recoding into a scalable version of the ATIM algorithm
acknowledgments
Kenneth CameronSantiago MadriñánChristian SchulzDennis Stevenson