BiOnym

27
iME4d - BiOnym A concept-mapping workflow for taxon names reconciliation Friday 7 March 2014 – Rome Fabio Fiorellato, Edward Vanden Berghe, Gianpaolo Coro, Nicolas Bailly

description

 

Transcript of BiOnym

Page 1: BiOnym

iME4d - BiOnymA concept-mapping workflow for taxon names reconciliation

Friday 7 March 2014 – Rome

A concept-mapping workflow for taxon names reconciliation

Fabio Fiorellato, Edward Vanden Berghe, Gianpaolo Coro, Nicolas Bailly

Page 2: BiOnym

Big Data make its way to biology

• Data volumes increase dramatically

– Management of large databases (millions of

records) easier

• no longer the realm of professional IT people• no longer the realm of professional IT people

– Biologists wake up to the advantages of

• Good data management, including preservation

• Data sharing

• Makes it possible to do science in a different

way

Page 3: BiOnym

‘Big Data’: Need for data integration

• Becoming a very realistic possibility– Management of DBs of millions of records

• Needs integration of small, restricted-scope datasets into massive databasesdatasets into massive databases– Intra-discipline integration (homogenous)– Inter-discipline integration (heterogeneous)

• Individual studies too small to inform on a scale commensurate with problems humankind faces– Evidence-based management of living resources– Climate change, global warming…

Page 4: BiOnym

iMarine biodiversity ‘ecosystem’

Taxon name enrichment

Taxon name reconciliationTaxon name access

Occurrence data access

Environmental data access

openModeller

AquaMaps

Distribution modelling

Occurrence data enrichment

Occurrence data reconciliation

Page 5: BiOnym

Central role of taxon name reconciliation

Taxon name enrichment

Taxon name reconciliationTaxon name access

Occurrence data access

Environmental data access

openModeller

AquaMaps

Distribution modelling

Occurrence data enrichment

Occurrence data reconciliation

Page 6: BiOnym

Taxonomic names are the keys…

• … Keys to bind together information on the

same taxon from different sources

• But there are problems• But there are problems

– Different research groups use different spellings

– Accidental misspellings

– Synonym, homonym reconciliation (but outside

scope of ByOnym)

Page 7: BiOnym

Some people can’t type

• Asthenognathas inaefaipes• Asthenognathus inaeqipes• Asthenognathus maefaipes• Asthenognathus maefaipes• Astheognathus inaequipes• Asthenognathus inaeguipes• Astheognathus inaeqinipes• Asthenognathus inaequipes

Page 8: BiOnym

Things can go wrong with Excel…

• Clupea harengus Linnaeus, 1758• Clupea harengus Linnaeus, 1759• Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1761• Clupea harengus Linnaeus, 1762• …

Page 9: BiOnym

… very wrong

• Clupea harengus Linnaeus, 1758• Clupea harengus Linnaeus, 1759• Clupea harengus Linnaeus, 1760• Clupea harengus Linnaeus, 1760• …

• Clupea harengus Linnaeus, 2254• Clupea harengus Linnaeus, 2255

Page 10: BiOnym

Taxonomic names are the keys…

• … Keys to bind together information on the

same taxon from different sources

• But there are problems• But there are problems

– Different research groups use different spellings

– Accidental misspellings

• Reconciliation is necessity, not luxury!!!

Page 11: BiOnym

Existing systems…

• … Are not flexible– We need flexibility, as our use case will dictate what the ‘optimal’

behaviour of the system is• E.g. manual vs automatic systems

• … Are often coupled to a single ‘reference list’• … Are often coupled to a single ‘reference list’– Using different tax. Scope for test and reference only increases

false positives• E.g. TaxaMatch with IRMNG…

• …Don’t always have throughput needed for large-scale projects – Largest db appr. 20M names – too many pairs!

Page 12: BiOnym

Our need

• A flexible, highly customisable, workflow-based approach to taxon name matching– User controls input– Output can be used as input in other – Output can be used as input in other

processes– Running on high performance computing

infrastructure

BiOnym!

Page 13: BiOnym

Introduction to BiOnym

• As a workflow for taxon name mapping and reconciliation, it is

a real-world application of the concept-mapping principles

• It is focused on the domain of taxonomy, with an initial

restriction to marine species only

• Provides a full workflow (not only the concept mapping part)

• Tries to address - and possibly solve - many issues common to • Tries to address - and possibly solve - many issues common to

the taxonomic community

• Its key concept is “species taxonomy”, where concept

properties are the taxonomic atoms

• Is open to integration from third party components

• Takes advantage of the iMarine distributed infrastructure

Page 14: BiOnym

The iMarine solution: existing state-of-the-art

• A general purpose concept mapping framework

(COMET) was already available in FAO:

– based on an existing FAO product (limited to the fishing vessels domain) initially developed with the support of the

Japanese trust fund

– domain independent (can be tailored to any custom – domain independent (can be tailored to any custom

domain with little effort)

– provided with all the necessary building blocks and

components for general purpose usage

Page 15: BiOnym

The iMarine solution: the quest for integration

• The integration of COMET inside iMarine was hailed

and expected.

• Its main challenges:

– Identify and define the custom domain (biological taxonomy)

– Design and implement:

• custom COMET matchlets (engine assigning similarity scores to pairs of names)

• additional, reusable tools for data interchange and data preparation

(DwCA converter, input parser, pre- and post-processors)

– Enable components to be easily distributed among worker nodes

inside the infrastructure

– Integration in the iMarine Statistical Manager

Page 16: BiOnym

The iMarine solution: a success story

• The COMET integration inside iMarine, as part of the

BiOnym workflow, is an example of success story:

– Solving the integration challenges required limited effort

• Harvest names for input through iMarine tools• Send output from BiOnym/COMET on to further tools

– The core matching capabilities of BiOnym were first made – The core matching capabilities of BiOnym were first made

available in June 2013

• Pre- and post-processing; parsing

• Matching through (a series of) matchlets, assigning a similarity

score to pairs of names

– The modular architecture enabled developers to add new

functionalities or improve existing ones with ease

Page 17: BiOnym

BiOnym key concepts and features

• Its modular architecture is open to contribution and

alternatives

– Workflow stages can be plugged-in with custom business implementations

– Can leverage third party components (e.g. the input data parsing is available

both as an in-house component or as a wrapper of the GNI parser from

globalnames.org)

• Based on standard and open formats• Based on standard and open formats

– Reference data are synthesized from DWCA files

– Input data and matching results are expected and produced as CSV files

– Matching results can also be emitted as XML files in the COMET format

• High flexibility

– Multiple chained matchers, each with its own configuration and thresholds

– Third party matchers (e.g. Tony Rees’ TaxaMatch) can be seamlessly ‘wrapped’

and plugged in the workflow

– Support for collaborative matching results evaluation (expected soon)

Page 18: BiOnym

BiOnym System: Overview

Page 19: BiOnym

BiOnym Workflow

Page 20: BiOnym

Where are we?

• Infrastructure has largely been built• User-friendly GUI is under development• Evaluation

– Efficiency: speed of computations– Efficiency: speed of computations• Parallel system, compares well with others

– Effectiveness: are the results OK?• Ran experiments on different test datasets

– Deliberately introducing misspellings in known lists– ‘Real’ misspellings manually corrected for other purposes

Page 21: BiOnym

The Bionym Interface

Never mind the small print.

Step 1: Select your data

Step 2: Compose the

matching process. This

relies on infrastructure

resources

Step 3: review results. This

can be private and ‘for your

eyes only’, or public.

Page 22: BiOnym

The BiOnym Workflow

Page 23: BiOnym

Visualising

quality assessment

of the results of BiOnym

Page 24: BiOnym

Where to from here?

• Validation– Not in terms of quality of output but…– Uptake by the biodiversity community

• Sustainability• Sustainability– Who will take over maintenance after iMarine

ends?

• BiOnym is a tool, it is the means to an end– Support Ecosystem Approach to Fisheries

Page 25: BiOnym

iMarine biodiversity ‘ecosystem’

Taxon name enrichment

Taxon name reconciliationTaxon name access

Occurrence data access

Environmental data access

openModeller

AquaMaps

Distribution modelling

Occurrence data enrichment

Occurrence data reconciliation

Page 26: BiOnym

BiOnym in its environmentEcological modelling – Rich data management

Taxa Authority FileTaxa Authority FileVernacular Names

Authority File

Vernacular Names

Authority FileDarwin Core ArchiveDarwin Core Archive

Based on the COMET Framework

developed by Fabio Fiorellato (FAO)

Page 27: BiOnym

Biodiversity Maps GenerationRetrieve via any GeoNetwork

Ecological modelling - Processing