presentation

DNA Barcode sequence identification incorporating taxonomic hierarchy and

within taxon variability

Damon P. Little

Cullman Program for Molecular Systematics StudiesThe New York Botanical Garden, Bronx, New York

test data sets (Little and Stevenson 2007)

gymnosperm nuclear ribosomal internal transcribed spacer 2 (nrITS 2)

1,037 sequences413 species71 genera

gymnosperm plastid encoded maturase K (matK)

522 sequences334 species75 genera

…alignment

3,906 bp1,601 (1,530–1,661) bpone per species

3,975 bp1,561 (1,412–1,661) bpall

matK

6,778 bp196 (115–260) bpone per species

8,733 bp137 (108–250) bpall

nrITS 2

aligned length

median unaligned length (IQR)

sequenceslocus

pairwise divergence

0.42%8.13–23.89%21.38%one per species

0.54%5.95–23.30%20.39%all

matK

0.21%25.75–33.30%29.39%one per species

0.09%26.53–34.48%30.99%all

nrITS 2

zero comparisons

interquartile rangemediansequenceslocus

measuring precision and accuracy

precision

matKnrITS2method

100 (67%)100% (83%)ATIM

60% (60%)80% (80%)DOME ID

100% (79%)98% (89%)DNA–BAR

95% (56%)93% (71%)BLAST/neighbor joining

76% (53%)87% (73%)BLAST/SPR

77% (55%)86% (74%)BLAST/parsimony ratchet

99% (61%)94% (80%)megaBLAST

99% (69%)94% (82%)BLAT

99% (67%)94% (81%)BLAST

44% (23%)65% (8%)neighbor joining

70% (41%)60% (11%)SPR search

71% (41%)58% (13%)parsimony ratchet

accuracy to species

matKnrITS2method

87% (53%)83% (71%)ATIM

50% (50%)67% (66%)DOME ID

73% (62%)65% (62%)DNA–BAR


78% (61%)79% (67%)BLAST/SPR


84% (64%)72% (68%)megaBLAST

82% (67%)66% (62%)BLAT

84% (68%)67% (63%)BLAST


78% (58%)69% (47%)SPR search


lessons learned

“global” alignments do not work

precision

matKnrITS2method

100 (67%)100% (83%)ATIM

60% (60%)80% (80%)DOME ID

100% (79%)98% (89%)DNA–BAR


76% (53%)87% (73%)BLAST/SPR


99% (61%)94% (80%)megaBLAST

99% (69%)94% (82%)BLAT

99% (67%)94% (81%)BLAST


70% (41%)60% (11%)SPR search


accuracy to species

matKnrITS2method

87% (53%)83% (71%)ATIM

50% (50%)67% (66%)DOME ID

73% (62%)65% (62%)DNA–BAR


78% (61%)79% (67%)BLAST/SPR


84% (64%)72% (68%)megaBLAST

82% (67%)66% (62%)BLAT

84% (68%)67% (63%)BLAST


78% (58%)69% (47%)SPR search


“fuzzy” matches are not precise

precision

matKnrITS2method

100 (67%)100% (83%)ATIM

60% (60%)80% (80%)DOME ID

100% (79%)98% (89%)DNA–BAR


76% (53%)87% (73%)BLAST/SPR


99% (61%)94% (80%)megaBLAST

99% (69%)94% (82%)BLAT

99% (67%)94% (81%)BLAST


70% (41%)60% (11%)SPR search


accuracy to species

matKnrITS2method

87% (53%)83% (71%)ATIM

50% (50%)67% (66%)DOME ID

73% (62%)65% (62%)DNA–BAR


78% (61%)79% (67%)BLAST/SPR


84% (64%)72% (68%)megaBLAST

82% (67%)66% (62%)BLAT

84% (68%)67% (63%)BLAST


78% (58%)69% (47%)SPR search


autoapomorphies (unique characters) work... but not always present

precision

100% (100%)100% (100%)DOME ID*

matKnrITS2method

100 (67%)100% (83%)ATIM

60% (60%)80% (80%)DOME ID

100% (79%)98% (89%)DNA–BAR


76% (53%)87% (73%)BLAST/SPR


99% (61%)94% (80%)megaBLAST

99% (69%)94% (82%)BLAT

99% (67%)94% (81%)BLAST


70% (41%)60% (11%)SPR search


accuracy to species

90% (90%)76% (75%)DOME ID*

matKnrITS2method

87% (53%)83% (71%)ATIM

50% (50%)67% (66%)DOME ID

73% (62%)65% (62%)DNA–BAR


78% (61%)79% (67%)BLAST/SPR


84% (64%)72% (68%)megaBLAST

82% (67%)66% (62%)BLAT

84% (68%)67% (63%)BLAST


78% (58%)69% (47%)SPR search


some sequences are simply unidentifiable

...remaining (insoluble) problems

identical sequences for multiple terminals

shared alleles between terminalsuse allele frequency as a predictor?

desirable methodologies and properties of

Sequence IDentification Engines (SIDEs)

Sequence IDentification Engines (SIDEs)

avoid global alignment by comparing short segments: pseudo–alignment

use exact matches

use autoapomorphies where possible...but allow the use of other characters too

context/text DNA recoding

characters are defined by flanking context=> pretext and postextpermit “alignment–free” comparisonssize and separation between pretext and postext must be arbitrarily delimited

states (text) limited by the proximity of context

terminals can be individual sequences or composites representing taxa


characters are defined by flanking context=> pretext and postextpermit “alignment–free” comparisonssize and separation between pretext and postext is arbitrarily

possible states (text) is limited by the length of the text

terminals can be individual sequences or composites representing taxa

querying text/context database

find pretext/text/postext in the query sequence and match to references


find pretext/text/postext in the query sequence and match to references

score terminals based on the number of matches

final score can be raw or based a weighting function

possible weighting functions

equal weights (raw score)

number of distinct texts=> up weights more variable characters

1/(number of distinct texts)=> down weights more variable characters

(number of texts)/(number of scores)

precision

98% (79%) 96% (86%)BRONX 1

88% (84%)91% (90%)BRONX 0

matKnrITS2method

100 (67%)100% (83%)ATIM

60% (60%)80% (80%)DOME ID

100% (79%)98% (89%)DNA–BAR


76% (53%)87% (73%)BLAST/SPR


99% (61%)94% (80%)megaBLAST

99% (69%)94% (82%)BLAT

99% (67%)94% (81%)BLAST


70% (41%)60% (11%)SPR search


accuracy to species

92% (75%)72% (67%)BRONX 1

76% (71%)59% (58%)BRONX 0

matKnrITS2method

87% (53%)83% (71%)ATIM

50% (50%)67% (66%)DOME ID

73% (62%)65% (62%)DNA–BAR


78% (61%)79% (67%)BLAST/SPR


84% (64%)72% (68%)megaBLAST

82% (67%)66% (62%)BLAT

84% (68%)67% (63%)BLAST


78% (58%)69% (47%)SPR search


BRONX conclusions

BRONX is more precise than existing algorithms

BRONX is sometimes more accurate than existing algorithms

BRONX is an incremental improvement

future directions

improve the scoring function in BRONX

dynamically size context/text

benchmark additional datasets for all methods

incorporate context/text recoding into a scalable version of the ATIM algorithm

acknowledgments

Kenneth CameronSantiago MadriñánChristian SchulzDennis Stevenson

http://barcoding.si.edu/

presentation

Documents

Transcript of presentation