Tony Rees: TAXAMATCH poster May 2009
-
Upload
tony-rees -
Category
Technology
-
view
183 -
download
1
description
Transcript of Tony Rees: TAXAMATCH poster May 2009
Tony Rees, CSIRO Marine and Atmospheric Research, Australia
contact: Tony Reesphone: +61 3 6232 5318email: [email protected] web www.cmar.csiro.au/datacentre/
Taxon scientific names are key identifiers in the world of biodiversity, yet for informatics applications they often fail to provide the required cross linkages on account of minor (or not so minor) differences in spelling arising from keying
or phonetic errors, OCR (optical character recognition) and transcription errors, emendations, gender endings of species epithets, differences in diacritical marks, and more.
For example, data on the fish genus Coelorinchus (present “correct” spelling) might be stored under variant spellings Caelorinchus (previously considered correct), Coelorhinchus, Coelorhynchus, Caelorhynchus, and so on, while the potential for random or semi-random keystroke, OCR or transcription errors is almost limitless. If such potential variant spellings cannot be reconciled, some or even all of the desired data may not be retrieved.
This poster introduces TAXAMATCH, a “fuzzy” or near match algorithm developed at CSIRO Marine and Atmospheric Research (Australia), with the specific purpose of providing optimal fuzzy matching for genus and species scientific names in real world situations, and capable of deployment over a remote reference database of spellings deemed correct, or incorporation into any local system to suit a user’s particular needs.
TAXAMATCH comprises a suite of custom filters and tests used in succession on genus, species epithet, plus authority where supplied, to return candidate near or “fuzzy” matches in a reference set of taxon names to any supplied input name. The actual tests employed include the following:
• Anexactmatchtest,bothbeforeand after minor normalisation
• Aphoneticmatchtest,usingacustomalgorithm “tuned” to the characteristics of taxon scientific names
• Acustom“ModifiedDamerau-LevenshteinDistance”(MDLD)algorithmwhichlooksforpossibleomitted,inserted,substitutedandtransposedcharactersandcharacterblocks
• Amodifiedn-gramcomparisonofauthornames and dates where supplied, including expansionofselectedknownabbreviationsof author names as appropriate.
TAXAMATCH operating principlesThecustomfilteringthathasbeendevelopedforTAXAMATCHatbothgenusandspeciesepithetlevelscomprises:
• Genusandspeciespre-filters, which servetospeedupthealgorithmexecutionbyexcludingnamesdeemedtobealmostcertainnottomatchfrombeingtested
• Genusandspeciespost-filters, which apply a set of rules to assist in the discrimination oflikely“true”from“false”nearmatches
• Agenuscosmetic filter, which presents onlyasubsetof“genusnearmatch”searchresultstothehumanwebinterface,whilepassing a wide range of genera through to the species stage for further testing
• Afinalresult shapingstage(whichcanbeswitchedoutifdesired),whichmasksmore distant near matches in the presence ofcloserones,butopensautomaticallytoshowthemwhenthelatterareabsent.
AschematicofoverallTAXAMATCHoperationisshowninFig.1,below.
TAXAMATCH reference implementationThe reference installation of TAXAMATCH iscurrentlyinstalledovertheIRMNG(Interim Register of Marine and Nonmarine Genera)databasehostedatCSIROMarineandAtmosphericResearch,availableviatheaccesspoint www.cmar.csiro.au/datacentre/irmng/, which(atmid2009)containsover1.4millionspeciesnamesfromtheCatalogueofLifeandothersources,togetherwithover400,000genusnames.TAXAMATCHisautomaticallyinvokedwhen single genus + species, or genus queries aremadesoastodisplaynotonlyexact,butalsoanynearmatchesintheIRMNGdatabase,toanyuser-suppliedinputname.Figs.2and3illustrate how TAXAMATCH will return a match of the correct spelled name “Homo sapiens” in response to an incorrectly spelled input name “Hombosapient”.Notethatinthisinstance,operationofthegenusandspeciespre-filtersmeansthatonly325ofthe445,004genera,and31ofthe1,459,171speciespresentlyinthereferencedatabaseareactuallyrequiredtobetested,whichcontributessignificantlytotherelativelyshortexecutiontimeforthequery(around1toafewsecondsperinputname,orlesswhenconductedwithoutthewebinterfaceandancillaryinformationpresented).
Figure 2: Web accessible IRMNG / TAXAMATCH search entry point www.cmar.csiro.au/datacentre/irmng/
Figure 4: Sample IRMNG search result for a batch of multiple species names to be checked, showing option presented for “fuzzy search” on names that do not have an exact match to any current target name in the IRMNG database at this time.
Figure 3: Result of above search for the entered term “Hombo sapient” against the IRMNG database
TAXAMATCH use casesArangeofusecasescanbeenvisagedforTAXAMATCH, including the following:
• Matchinga(weborother)user’senteredtextagainststoredbiodiversityinformation,where either the input or stored name maybemisspelledoravariantspelling
• Checkingofnamesona“ListA”thatdonotmatchentriesonanequivalent“ListB”(butmaypotentiallyincludethesameentitiesundervariantspellings)
• Queryexpansion–fordistributeddatasearches(whereallnamevariantscanbeindexedinadvance),aswouldbeapplicableto(e.g.)OBIS,GBIF,etc.
• Deduplicationofstoredlists–especiallythoseconstructedbyaggregationof names from multiple sources
• “Asyoutype”spellcorrection
• Applicationintaxonomicnamerecognitionsoftware,e.g.viaOCRofscannedspecimenlabels,ordetectionof taxonomic names in mixed text streams(biologicalpublications,etc.)
ThewebaccessibleIRMNG/TAXAMATCHsearch entry point also currently supports theinputofbatchesofuptoapproximately2,500genusnamesor1,200genus+speciesnamesforautomatedchecking,asshowninFig.4,andmechanismsforcheckinglargerbatchesofnamescanbeimplementedviaalternativemechanismsasdesired.
ConclusionTAXAMATCHappearstoofferagoodsolutiontotheproblemsofnearmatchinggenusand/orspeciesscientificnames,whetherformatchingusers’misspelledquerytermstocorrectlystoredtargetdata(orviceversa),listcross-matchingorinternaldeduplication,orasaprototypewebaccessibletaxonomicspellcheckingservice.SeveraldevelopmentareasforTAXAMATCHarecurrentlyunderactiveconsideration,andinterestedpotentialusersordevelopersare encouragedtocontacttheauthorattheaddressshownbelowortovisitthe TAXAMATCHwebpagewww.cmar.csiro.au/datacentre/taxamatch.htm.
References
Rees,T.(2008).TAXAMATCH,a“fuzzy”matchingalgorithmfortaxonnames,andpotentialapplicationsintaxonomicdatabases.TDWG 2008 Annual Conference, Perth, Australia, abstractandpresentationavailableviawww.tdwg.org/conference2008/program/.
Rees,T.(2009inpress).TAXAMATCH,analgorithmfornear(‘fuzzy’)matchingofspeciesscientificnamesintaxonomicdatabases.Biodiversity Informatics(submitted).
Acknowledgements
IthankMiroslawRyba,CSIROMarineandAtmosphericResearch,forprogramminganddatabaseassistance,andBarbaraBoehmer,USAforassistancewithmodifyingheroriginalOracle®LevenshteinDistanceimplementationforTAXAMATCHuse.
PhotographscourtesyofKarenGowlett-Holmes.
Fuzzy matching of taxon names for biodiversityinformaticsapplications
Acropaginula <> ArcopaginulaMeosarmatium <> Neosarmatium
Peneus <> Penaeusfaveolata <> flaveolata
capricornicus <> capricornensisabrohlensis <> abrolhensis
input genus + species (+ auth.)
available genus names
available species
genus names tested
species tested
genus near matches
species near matches
species authorities
auth. comparator
genus cosmetic filter
normalised input genus
genus pre-filter
species pre-filter
genus post-filter
species post-filter
ranking + result shaping
genus test
species testnormalised input species
normalised input authority
genus near matches displayed
species near matches displayed
parsing and normalisation
Figure 1: Schematic of TAXAMATCH operation
available genus+ species names
(+ auth’s)
PosterdesignbyLeaCrosswell–CommunicationGroup,CSIROMarineandAtmosphericResearch–May2009