NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st...
-
Upload
national-information-standards-organization-niso -
Category
Education
-
view
430 -
download
0
Transcript of NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BCE themes in 21st...
The Many and the OneBCE problems in 21st c. data curation
Tracking it Back to the Source: Managing and Citing Research DataNISO Forum, Denver, Sept 24, 2012
Allen H. RenearGraduate School of Library and Information Science
University of Illinois at Urbana-Champaign
Principal researchers of material presented: David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H. Renear
Center for Informatics Research in Science and ScholarshipGraduate School of Library and Information Science
University of Illinois at Urbana-Champaign
NSF/OCI-ITR DataNet Award #0830976IMLS/LB Award #RE-05-08-0062-08
Problems, Problems, Problems
Identity problems: – Is this the data we think it is? Is it the same data as that data?
(involves issues of authenticity, integrity, encoding)
Meaning problems: – What is this data supposed to be telling us?
(involves interpreting the semantics of the data)
Relationship problems: – How is this data related to that data?
(involves issues of data provenance)Integration problems:
– How can I combine this data with other data?(involves harmonizing conflicts at multiple levels)
Interoperation problems: – how can I get this data to work with my software?
(involves conversion to equivalent formats)
An issue underlying all these is representation…how do files of digital files represent facts about the world?
Identity Problems
Two scientists, Jill and John, used the same data.
What does that mean?
And how can well tell?
Identity Problems
Compare:
Two scientists, Jill and John, used the same statistician.
Identity Problems
Compare:
Two scientists, Jill and John, used the same centrifuge.
Identity and Representation Levels
Consider two files with the
… same data,
but relational tables in one case
and RDF triples in another
… with the same data and the same RDF triples,
but an XML serialization in one case,
an N3 serialization in another
… with the same data, the same RDF triples, the same N3 serialization,
but UTF-8 character encoding in one case
and UTF-16 encoding in another
How many of levels do we need? How do we define and manage them?
How can they be identified and re-identified?
Which identifier schemes for which level?
What is a dataset anyway?!
Maybe we should ask a scientist
They’ll have an answer, right?
6
There are almost as many answers as scientists
7
Cries from the heart
“ the terms ‘Data Product’, ‘Data Set,’ and ‘Version’ are overlaid with multiple meanings between
communities.”
(Barkstrom, 2009)
“There is ambiguity in what type of object a dataset is; with different groups of users applying different
connotationsThere needs to be an explicit statement of what
the intended preservation of a dataset will imply.”
(Pepler, 2008)
8
Forcing us to conclude…
No single object can possibly have all those attributes
Therefore it is impossible to give the common colloquial notion of dataset a precise definition
It must instead be replaced by a family of new more specific concepts
Sound familiar?
9
10
FRBR
A FRBR inspired solution
FRBR eliminates the ordinary “book” from our world
The ordinary “book” can be simultaneouslyabout chordata, in French, typeset in neo-Bauhaus, mustard-stained
but FRBR replaces the book with four objects
the work is about chordata, the expression is in French, the manifestation is typeset in neo-Bauhaus, the item is mustard-stained
12
FRBR entities and attributes
Work: “an … intellectual or artistic creation”
Expression: “the … realization of a work … notation … etc.”
Manifestation: “the physical embodiment of an expression of a work”.
Item: “a single exemplar of a manifestation”
Attribute assignments characteristically disjoint
A work may have a subject.
It does not have a language, typeface, or condition.
An expression may have a language;
It does not have a subject.(or a typeface or a condition).
A manifestation may have a typeface.
It does not have a subject or a language(or a condition)
An item may have a condition.
It does not have a subject, language, or typeface.
13
Entities? Really?
Aren’t some of those rectangles just nominalized relationships?
Ambiguities
Is
<object name="sample_31"><feature name="U22376" value="408" /><feature name="X59417" value="1784" />
An expression?
Is “00001011” an expression?
14
FRBR Refactored
15
Story
Symbol Structure
Symbol Structure
Matter & Energy
M:M
M:M
M:M
FRBR refactored and applied to datasets
Instantiation level
[Semantic Level]
[Syntax Level] [Encoding levels]
Based on the Systematic Assertion Model (SAM) for modeling datasets, developed by David Dubin et al.
C1: observations
expressed by…
S1: RDF triples
encoded by…
S2: N3 statements
encoded by …
S3: Unicode characters
encoded by…
S4: UTF-8 bit streams
inscribed in…
M1: RAID array state
All M:M
Identifiers
What do we identify with identifiers?
An entity?
Content
Symbol structures
Patterned matter and energy
A nominalized relationship?
How do we confirm identification?
17
Identification
How do we identify an expression?
How do we identify an encoding?
How do we identify the data?
On the practical side we do this every day
On the theoretical side it is very difficult to usefully formalize.
18
Identity and change problems in Planets
19
From the Planets Conceptual Data Model, Sharpe et al. (2006)
Identity and change problems in Planets
20
• A file is a bitstream
• A file can be modified
• But a bitstream cannot be modified.
Credits to Dave Dubin, Simone Sacchi, Karen Wickett. Data Concepts Group, Data Conservancy (NSF/OCI-ITR DataNet Award #0830976)
Center for Informatics Research in Science and Scholarship (CIRSS)
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
Director: Carole L Palmer
Associate Director: Cathy Blake
c. 12 affiliated GSLIS faculty; 8 Phd students.
CIRSS research groups:
Data Practices: social science of information work
Socio-Technical Data Analytics: algorithms + people
*Data Concepts: modeling for integration/computation
Professional Education:
Data curation specialization within an ALA-accredited LIS program
Other options are being planned
21
CIRSS Data Concepts Group
Rationale
Integration and interoperability requires robust formal conceptual models for scientific data
Especially if semantic technologies are going to be exploited.
Our current models aren’t good enough
Mission
The data concepts group takes a logic-based approach to to solving conceptual modeling problems in scientific data curation
This research is being carried out by the Data Concepts Group at the Center for Research in Informatics and Scholarship (CIRSS) at the University of Illinois at Urbana-Champaign, Carole L. Palmer, Director.
Principal contributors include David Dubin, Karen M. Wickett, Simone Sacchi, Richard Urban, Allen H Renear
Questions?
NSF/OCI-ITR DataNet Award #0830976IMLS/LB Award #RE-05-08-0062-08