NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st...

24
The Many and the One BCE problems in 21 st c. data curation Tracking it Back to the Source: Managing and Citing Research Data NISO Forum, Denver, Sept 24, 2012 Allen H. Renear Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Principal researchers of material presented: David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H. Renear Center for Informatics Research in Science and Scholarship Graduate School of Library and Information Science University of Illinois at Urbana-Champaign NSF/OCI-ITR DataNet Award #0830976 IMLS/LB Award #RE-05-08-0062-08

Transcript of NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st...

Page 1: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

The Many and the OneBCE problems in 21st c. data curation

Tracking it Back to the Source: Managing and Citing Research DataNISO Forum, Denver, Sept 24, 2012

Allen H. RenearGraduate School of Library and Information Science

University of Illinois at Urbana-Champaign

Principal researchers of material presented: David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H. Renear

Center for Informatics Research in Science and ScholarshipGraduate School of Library and Information Science

University of Illinois at Urbana-Champaign

NSF/OCI-ITR DataNet Award #0830976IMLS/LB Award #RE-05-08-0062-08

Page 2: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Problems, Problems, Problems

Identity problems: – Is this the data we think it is? Is it the same data as that data?

(involves issues of authenticity, integrity, encoding)

Meaning problems: – What is this data supposed to be telling us?

(involves interpreting the semantics of the data)

Relationship problems: – How is this data related to that data?

(involves issues of data provenance)Integration problems:

– How can I combine this data with other data?(involves harmonizing conflicts at multiple levels)

Interoperation problems: – how can I get this data to work with my software?

(involves conversion to equivalent formats)

An issue underlying all these is representation…how do files of digital files represent facts about the world?

Page 3: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Identity Problems

Two scientists, Jill and John, used the same data.

What does that mean?

And how can well tell?

Page 4: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Identity Problems

Compare:

Two scientists, Jill and John, used the same statistician.

Page 5: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Identity Problems

Compare:

Two scientists, Jill and John, used the same centrifuge.

Page 6: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Identity and Representation Levels

Consider two files with the

… same data,

but relational tables in one case

and RDF triples in another

… with the same data and the same RDF triples,

but an XML serialization in one case,

an N3 serialization in another

… with the same data, the same RDF triples, the same N3 serialization,

but UTF-8 character encoding in one case

and UTF-16 encoding in another

How many of levels do we need? How do we define and manage them?

How can they be identified and re-identified?

Which identifier schemes for which level?

Page 7: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

What is a dataset anyway?!

Maybe we should ask a scientist

They’ll have an answer, right?

6

Page 8: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

There are almost as many answers as scientists

7

Page 9: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Cries from the heart

“ the terms ‘Data Product’, ‘Data Set,’ and ‘Version’ are overlaid with multiple meanings between

communities.”

(Barkstrom, 2009)

“There is ambiguity in what type of object a dataset is; with different groups of users applying different

connotationsThere needs to be an explicit statement of what

the intended preservation of a dataset will imply.”

(Pepler, 2008)

8

Page 10: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Forcing us to conclude…

No single object can possibly have all those attributes

Therefore it is impossible to give the common colloquial notion of dataset a precise definition

It must instead be replaced by a family of new more specific concepts

Sound familiar?

9

Page 11: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

10

FRBR

Page 12: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

A FRBR inspired solution

FRBR eliminates the ordinary “book” from our world

The ordinary “book” can be simultaneouslyabout chordata, in French, typeset in neo-Bauhaus, mustard-stained

but FRBR replaces the book with four objects

the work is about chordata, the expression is in French, the manifestation is typeset in neo-Bauhaus, the item is mustard-stained

Page 13: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

12

FRBR entities and attributes

Work: “an … intellectual or artistic creation”

Expression: “the … realization of a work … notation … etc.”

Manifestation: “the physical embodiment of an expression of a work”.

Item: “a single exemplar of a manifestation”

Attribute assignments characteristically disjoint

A work may have a subject.

It does not have a language, typeface, or condition.

An expression may have a language;

It does not have a subject.(or a typeface or a condition).

A manifestation may have a typeface.

It does not have a subject or a language(or a condition)

An item may have a condition.

It does not have a subject, language, or typeface.

Page 14: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

13

Entities? Really?

Aren’t some of those rectangles just nominalized relationships?

Page 15: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Ambiguities

Is

<object name="sample_31"><feature name="U22376" value="408" /><feature name="X59417" value="1784" />

An expression?

Is “00001011” an expression?

14

Page 16: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

FRBR Refactored

15

Story

Symbol Structure

Symbol Structure

Matter & Energy

M:M

M:M

M:M

Page 17: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

FRBR refactored and applied to datasets

Instantiation level

[Semantic Level]

[Syntax Level] [Encoding levels]

Based on the Systematic Assertion Model (SAM) for modeling datasets, developed by David Dubin et al.

C1: observations

expressed by…

S1: RDF triples

encoded by…

S2: N3 statements

encoded by …

S3: Unicode characters

encoded by…

S4: UTF-8 bit streams

inscribed in…

M1: RAID array state

All M:M

Page 18: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Identifiers

What do we identify with identifiers?

An entity?

Content

Symbol structures

Patterned matter and energy

A nominalized relationship?

How do we confirm identification?

17

Page 19: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Identification

How do we identify an expression?

How do we identify an encoding?

How do we identify the data?

On the practical side we do this every day

On the theoretical side it is very difficult to usefully formalize.

18

Page 20: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Identity and change problems in Planets

19

From the Planets Conceptual Data Model, Sharpe et al. (2006)

Page 21: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Identity and change problems in Planets

20

• A file is a bitstream

• A file can be modified

• But a bitstream cannot be modified.

Credits to Dave Dubin, Simone Sacchi, Karen Wickett. Data Concepts Group, Data Conservancy (NSF/OCI-ITR DataNet Award #0830976)

Page 22: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

Center for Informatics Research in Science and Scholarship (CIRSS)

Graduate School of Library and Information Science

University of Illinois at Urbana-Champaign

Director: Carole L Palmer

Associate Director: Cathy Blake

c. 12 affiliated GSLIS faculty; 8 Phd students.

CIRSS research groups:

Data Practices: social science of information work

Socio-Technical Data Analytics: algorithms + people

*Data Concepts: modeling for integration/computation

Professional Education:

Data curation specialization within an ALA-accredited LIS program

Other options are being planned

21

Page 23: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

CIRSS Data Concepts Group

Rationale

Integration and interoperability requires robust formal conceptual models for scientific data

Especially if semantic technologies are going to be exploited.

Our current models aren’t good enough

Mission

The data concepts group takes a logic-based approach to to solving conceptual modeling problems in scientific data curation

Page 24: NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st century data curation

This research is being carried out by the Data Concepts Group at the Center for Research in Informatics and Scholarship (CIRSS) at the University of Illinois at Urbana-Champaign, Carole L. Palmer, Director.

Principal contributors include David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H Renear

Questions?

NSF/OCI-ITR DataNet Award #0830976IMLS/LB Award #RE-05-08-0062-08