Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity...

50
Addressing the name:meaning drift challenge in open biodiversity information environments Pleas e @taxonbyte s Nico M. Franz 1 , Salvatore A. Anzaldo 1 , Edward E. Gilbert 1 , M. Andrew Jansen 1 , M. Andrew Johnston 1 & Bertram Ludäscher 2 1 School of Life Sciences, Arizona State University 2 iSchool, University of Illinois at Urbana-Champaign Symposium: Building the Biodiversity Knowledge Graph for Insects – Components, Progress, and Challenges 2016 XXV International Congress of Entomology, Orlando, FL – September 26, 2016 (#ICE2016) Presentation available @ SlideShare: http ://tinyurl.com/franz-et-al-ice- 2016

Transcript of Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity...

Page 1: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Addressing the name:meaning driftchallenge in open biodiversity

information environments

Please

@taxonbytes

Nico M. Franz1 , Salvatore A. Anzaldo1, Edward E. Gilbert1,

M. Andrew Jansen1, M. Andrew Johnston1 & Bertram Ludäscher2

1 School of Life Sciences, Arizona State University2 iSchool, University of Illinois at Urbana-Champaign

Symposium: Building the Biodiversity Knowledge Graph for Insects – Components, Progress, and Challenges2016 XXV International Congress of Entomology, Orlando, FL – September 26, 2016 (#ICE2016)

Presentation available @ SlideShare: http://tinyurl.com/franz-et-al-ice-2016

Page 2: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Our biodiversity informatics research program, summarized

• We are no longer just putting articles and monographs on library shelves.

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Page 3: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Our biodiversity informatics research program, summarized

• We are no longer just putting articles and monographs on library shelves.

• This is more than 'just technology'; we must develop new systematic theory

to deal with inherently dynamic, open data systems.

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Page 4: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Our biodiversity informatics research program, summarized

• We are no longer just putting articles and monographs on library shelves.

• This is more than 'just technology'; we must develop new systematic theory

to deal with inherently dynamic, open data systems.

• The concept taxonomy approach has practical implications for strengthening

the roles that individual experts play in big biodiversity data environments.

91dd0ee1-8a37-4efc-85b7-8176874cf5be

Page 5: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Products – concept taxonomy in theory and in practice ZooKeys. doi:10.3897/zookeys.528.6001

Semantic Web. doi:10.3233/SW-160220

Biological Theory (in review). doi:10.1101/022145

PloS ONE. doi:10.1371/journal.pone.0118247

Systematics Biodiv. doi:10.1080/14772000.2013.806371

Systematic Biology. doi:10.1093/sysbio/syw023

Biodiversity Data Journal (in review). #6093Research Ideas and Outcomes (in review). #6302

Page 6: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Premise: We're lucky that insect revisions are not so frequent

"In biology, there are many taxa that are so under-studied thatthey are only known from their original description and

none or very few subsequent references […].

The name alone, so long as it is a unique name,is sufficient to locate all related material."

– David Remsen 2016: 213

Source: Remsen. 2016. The use and limits of scientific names […]. ZooKeys 550: 207–223. doi:10.3897/zookeys.550.9546

Page 7: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Diagnosis:

What happens in dynamic, open systems?

Page 8: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Snapshot of a more frequently revised organismal lineage

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)

Page 9: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Snapshot of a more frequently revised organismal lineage

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)

• Vertical sections identify taxonomic concept regions

Page 10: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Snapshot of a more frequently revised organismal lineage

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)

• Vertical sections identify taxonomic concept regions

• Colors identify lineages of taxonomic names (epithets) in use

Page 11: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Snapshot of a more frequently revised organismal lineage

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

• 9 schemata for the NA Cleistes/Cleistesiopsis complex (orchids)

• Vertical sections identify taxonomic concept regions

• Colors identify lineages of taxonomic names (epithets) in use

• There is no consensus! Five incongruent schemata are used concurrently

Page 12: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Premise:

If incongruent taxonomies are endorsed– locally, provisionally, and democratically –

then what is the impact foraggregated biodiversity data?

Page 13: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Conclusion:

Taxonomy becomes a variable that we need to represent,

and control for

Page 14: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus'

• Query: "Where do these orchid species occur?"

• Same set of 250 orchid specimens, according to 4 taxonomies.

"Contr

olling

the t

axonom

ic var

iable" Example: the Cleistes use case

Page 15: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

"Contr

olling

the t

axonom

ic var

iable"

• Query: "Where do these orchid species occur?"

• Same set of 250 orchid specimens, according to 4 taxonomies.

Example: the Cleistes use case

Page 16: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

The (formerly) federal 'standard'"C

ontr

olling

the t

axonom

ic var

iable"

Page 17: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

Page 18: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

Expert views are in conflict

Page 19: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

Expert views are in conflict

"Just bad"

Page 20: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora

Impact:Name-based aggregation has created

a novel synthesis that nobody believes in

"Contr

olling

the t

axonom

ic var

iable"

"Just bad"

Page 21: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

"Just bad"

Expert views are in conflict

Solution:Instead of aggregating

an artificial 'consensus', …

Page 22: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

"Just bad"

Expert views are reconciled

Solution:Instead of aggregating

an artificial 'consensus',build translation services

Page 23: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Challenges:

How can we redesign aggregation to yieldhigh-quality biodiversity data packages?

Page 24: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Challenges:

How can we redesign aggregation to yieldhigh-quality biodiversity data packages?

What does this mean for Darwin Core1

and how we use this aggregation standard?

1 Wieczorek et al. 2012. Darwin Core: an evolving […]. PLoS ONE 7(1): e29715. doi:10.1371/journal.pone.0029715

Page 25: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Preview of solution with 8 steps

• DwC is insufficient, and part of the problem

Step 7:

Page 26: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 1: Represent only taxonomic concept labels (TCLs) 1

• Syntax (TCL): taxonomic name [author, year, page] sec. source

1 Multi-taxonomy input/alignment visualizations generated with Euler/X toolkit: https://github.com/EulerProject/EulerX

Cleistes divaricatasec. Gregg & Catling 1993

Pogoniasec. Brown & Wunderlin 1997

Page 27: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 1: DwC score keeping TCLs are optional; < 1% realized?

• TCL ~ DwC: nameAccordingTo

• SCAN: 19,722 of nearly 9 million records have TCLs (0.2%)

• Lack of enforcement to use TCLs makes standard less big data-ready

DwC record with nameAccordingTo (TCL)(BDJ)

"Who authors GBIF's Backbone?"https://storify.com/taxonbytes/who-authors-gbif-s-backbone

Page 28: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 2: Represent each source coherently (Parent-Child relationships)

• Syntax (PC): TCL1 is a child/parent of TCL2 [where TCL1/2 = same source]

Cleistesiopsis bifaria sec. Pans. & de Barr. 2008

is a child ofCleistesiopsis sec. Pans. & de Barr. 2008

Page 29: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 2: DwC score keeping Not (adequately) represented

• PC ~ DwC: genus, family, order (etc.; higherClassification)

• However, higher-level names in DwC are not modeled as TCLs

• Taxonomic coherence of sources cannot be preserved with DwC alone

DwC record with higherClassification(BDJ)

Page 30: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 3: Do not force a single hierarchy onto all tip-level TCLs

• Syntax (PC): Tip-level TCL1 , TCL2 , etc. [where TCL1/2 = different sources]

Page 31: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 3: DwC score keeping Optional Not (ever?) practiced

• No PC ~ DwC: infra-/specificEpithet only

• Typically, a single, 'unitary' higher-level classification is represented

• Combinations of algorithmic and social practices achieve the single hierarchy

"Who authors GBIF's Backbone?"https://storify.com/taxonbytes/who-authors-gbif-s-backbone

Page 32: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 4: Link TCLs via expert-provided RCC–5 articulations

• Syntax (RCC–5): TCL1 {==, >, <, ><, !} TCL2 [where TCL1/2 = diff. sources]

• RCC–5 = Region Connection Calculus

• 14 articulations provided by: http://tinyurl.com/Weakley-Flora-2015

Cleistes bifaria "Coastal Populations" sec. Smith et al. 2004== (is congruent with)

Cleistesiopsis oricamporum sec. Brown & Pans. 2009==

Page 33: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf

Region Connection Calculus (semantics: set constraints)

== < > >< !• Two regions N, M are either:

• congruent (N == M)• properly inclusive (N < M)• inversely properly inclusive (N > M)• overlapping (N >< M)• exclusive of each other (N ! M)

Page 34: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf

Region Connection Calculus (semantics: set constraints)

== < > >< !• Two regions N, M are either:

• congruent (N == M)• properly inclusive (N < M)• inversely properly inclusive (N > M)• overlapping (N >< M)• exclusive of each other (N ! M)

• RCC–5 articulations answer the query: "can we join regions N and M?"

• Taxonomies have multiple RCC–5 alignable components: nodes (parents, children), node-associated traits, even node-anchoring specimens

Page 35: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 4: DwC score keeping Not (adequately) represented

• RCC–5 ~ DwC: accepted(Scientific)Name(Usage), relationshipOfResource,

taxonomicStatus (etc.;

nomenclatural relationships)

• Nomenclatural relationships are type-focused, not region-focused

• "Taxonomic Concept Schema" yes! (however: http://www.tdwg.org/standards/117)

Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063

Example:Milkweed butterflies

Page 36: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Oscillating meanings of the epithet hyalites – 1911 to 2003

Phenotypic diversityTy

pe-a

ncho

red

nam

e id

entit

y re

latio

ns

Source: Vane-Wright. 2003. Indifferent philosophy versus […]. Syst. Biodiv. 1: 3–11. doi:10.1017/S1477200003001063

Page 37: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 5: Identify occurrence records only to TCLs

Records: EKY39235 MTSU003611 NCSC00040204 …

Records: BOON8098 CLEMS0061133 WILLI39399 …

Records: GMUF-0039355 IBE006808 USCH58399 …

Records: CONV0006268 MDKY00006482 NCU00038930 …

Records: BRYV0023582, BRYV0023584 KHD00032030, MISS0016604 MMNS000227, NCSC00040206 USMS_000002923, USMS_000002924 VSC0053223, VSC0065528 …

Records: ARIZ393087 DBG39049 USCH51217 …

Records: NCU00040710 USCH96248 VSC0053218 …

Records: CLEMS0012881 FUGR0003293 GA023130 …

Records: BOON8100 NCSC00040210 SJNM45487 …

Records: GA023144 LSU00012494 MISS0016608 …

Records: IBE006810, IND-0012374, MMNS000227

Records: NY8654

• Syntax (ID): Occurrence / organism is identified to TCL

"CLEMS0012881"is identified to

Cleistes divaricata sec. Smith et al. 2004

[additional ID metadata]

Page 38: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

DwC record with Identification metadata(BDJ)

# 5: DwC score keeping ID metadata optional; > 50% realized

• ID ~ DwC: Identification, (date)identified(By), identificationReference

• SCAN: 4,715,277 of nearly 9 million records have ID metadata (52.5%)

• Enforcement…still also require use of TCLs

Page 39: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 6: Generate comprehensive, consistent RCC–5 alignments

• Euler/X is a toolkit that infers logically consistent RCC–5 alignments

Page 40: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 6: Generate comprehensive, consistent RCC–5 alignments

• Valued-added: MIR – set of Maximally Informative Relations containing

the RCC–5 articulation for every possible TCL pair scalability

Reasoner inference

Page 41: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 7: Joining occurrence-to-TCL identifications & RCC–5 alignments

Records: BOON8098, CLEMS0061133, CONV0006268, EKY39235 GMUF-0039355, IBE006808, IBE006810, IND-0012374 MDKY00006482, MMNS000227, MTSU003611, NCSC00040204 NCU00038930, NY8654, USCH58399, WILLI39399 …

Records: ARIZ393087, BRYV0023582, BRYV0023584, DBG39049 KHD00032030, MISS0016604, MMNS00022, NCSC00040206 USMS_000002923, USMS_000002924, VSC0053223, VSC0065528 …

Records: BOON8100, CLEMS0012881, FUGR0003293 GA023130, GA023144, LSU00012494 MISS0016608, NCSC00040210, NCU00040710 SJNM45487, USCH96248, VSC0053218 …

• Specimen integration is fully driven by TCL-to-TCL RCC–5 signals

Page 42: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Source: Franz et al. 2016. Controlling the taxonomic variable […]. Research Ideas and Outcomes (RIO). (In Review)

The 'consensus' The 'bible'

The (formerly) federal 'standard'

The 'best', latest regional flora"C

ontr

olling

the t

axonom

ic var

iable"

Impact:"Please select your preference (A – D);

we can perform all translations"

Page 43: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records resolves incongruent lineage of name usages

# 8: "Do you trust us now?" Aggregation as a translational service

Page 44: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records resolves incongruent lineage of name usages

• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"

• Returns record subset resolving only one narrowly circumscribed concept

# 8: "Do you trust us now?" Aggregation as a translational service

Page 45: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

# 8: "Do you trust us now?" Aggregation as a translational service

• We can now respond to queries such as:

• "Show all specimens identified to the taxonomic name Cleistes divaricata"

• Returns many records resolves incongruent lineage of name usages

• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"

• Returns record subset resolving only one narrowly circumscribed concept

• "Now show specimens identified to the TCL Cleistes divaricata sec. RAB 1968,

yet translated into the more granular TCLs sec. Weakley 2015"

• Returns (again) many records, yet represents and contrasts two treatments,

as opposed to providing the ambiguous lineage view (above)

• "Show all specimens with ambiguous 2010/2015 TCL identifications…" (etc.)

Page 46: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Conclusions – designing trusted biodiversity data services

• The Darwin Core standard for aggregating biodiversity data:

(1) Has under-utilized options for better representing taxonomic expertise

(2) Is part of a design paradigm that undermines the plurality of expertise

Page 47: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

• The Darwin Core standard for aggregating biodiversity data:

(1) Has under-utilized options for better representing taxonomic expertise

(2) Is part of a design paradigm that undermines the plurality of expertise

• We are developing new solutions – including TCLs, PC relations, RCC–5,

and scalable logic applications – that realize data aggregation via

translational services, without disrupting the formation of expert-licensed,

high-quality biodiversity data packages

Conclusions – designing trusted biodiversity data services

Page 48: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

• The Darwin Core standard for aggregating biodiversity data:

(1) Has under-utilized options for better representing taxonomic expertise

(2) Is part of a design paradigm that undermines the plurality of expertise

• We are developing new solutions – including TCLs, PC relations, RCC–5,

and scalable logic applications – that realize data aggregation via

translational services, without disrupting the formation of expert-licensed,

high-quality biodiversity data packages

• All of us – not just aggregators – "own" the responsibility of designing

systems where the plurality of taxonomic expertise is fairly accommodated

Conclusions – designing trusted biodiversity data services

Page 49: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Acknowledgments & links to products

• Cleistes use case: Alan Weakley (UNC)

• Euler/X toolkit: Shizhuo Yu (UC Davis)

• Data trajectories: Beckett Sterner (ASU)

• OBKMS design: Viktor Senderov (Pensoft)

• NSF DEB–1155984, DBI–1342595 (PI Franz)

• NSF IIS–118088, DBI–1147273 (PI Ludäscher)

• Euler/X code @ https://github.com/EulerProject/EulerX

• Franz et al. 2016. Two influential primate classifications logically aligned. Systematic Biology 65(4): 561–582. Link

Page 50: Franz et al ice 2016 addressing the name meaning drift challenge in open ended biodiversity information environments

Interested in exploringmulti-taxonomy and/or-phylogeny alignments?

Please contact me.

[email protected]@taxonbytes

https://biokic.asu.edu/