Data Curation and Provenance for the EUDAT Collaborative Data … · 2018-11-12 · Data Curation...

1
Data Curation and Provenance for the EUDAT Collaborative Data Infrastructure Alexander Atamas 1 , Rene van Horik 1 , Linda Reijnhoudt 1 , Javier Quinteros 2 , Vasily Bunakov 3 1 Data Archiving and Networked Services (DANS), The Hague, the Netherlands 2 GFZ German Research Centre for Geoscience, Potsdam, Germany 3 Science and Technology Facilities Council (STFC), Didcot, the United Kingdom Figure 1. Data workflow of the HERBADROP data pilot. Figure 2. Data workflow from the GEOFON pilot. HERBADROP offers an archival service for long-term preservation of herbarium specimen images and extracts information from those images. GEOFON is a seismological data centre providing long- term access for seismic waveforms as well as a data provision centre. B2SAFE service is used in the first step of the ingestion process into the EDTR B2SAFE transfer service is engaged to transmit existing images of herbarium specimens along with the associated metadata to the CINES repository B2HANDLE is used to manage and store Persistent Identifiers B2SAFE is employed to accomplish data management tasks B2SAFE actions are executed after new data is detected B2HANDLE is used to manage and store Persistent Identifiers T wo Use Cases as a Proof-of-Concept for Data Curation Activities Versioning of Research Data Provenance of Research Data Figure 4. Data workflow of the B2SAFE versioning. Versioning is required because research data often undergo rather intensive lifecycle. Individual data sets are modified, updated or recalculated. Hence, in the scientific community, there is a need for versioning functionality for proper data curation and provenance of research data. A version of digital object is understood as a timestamped copy of the object. test.txt__v1 PID URL EUDAT/CHECKSUM_TIMESTAMP EUDAT/PROFILE_VERSION EUDAT/CHECKSUM EUDAT/FIXED_CONTENT EUDAT/LATEST_VERSION EUDAT/WAS_DERIVED_FROM PID URL EUDAT/DO_VERSION_NUMBER EUDAT/FIXED_CONTENT EUDAT/CHECKSUM EUDAT/PROFILE_VERSION EUDAT/CHECKSUM_TIMESTAMP EUDAT/FIO EUDAT/VERSION_SOURCE EUDAT/REVISION_OF EUDAT/ROR EUDAT/WAS_DERIVED_FROM PID URL EUDAT/DO_VERSION_NUMBER EUDAT/FIXED_CONTENT EUDAT/CHECKSUM EUDAT/PROFILE_VERSION EUDAT/CHECKSUM_TIMESTAMP EUDAT/FIO EUDAT/VERSION_SOURCE EUDAT/REVISION_OF EUDAT/ROR test.txt test.txt__v2 Source file Version #1 Version #2 Versions are stored in a separate directory created only for storage of versions. Persistent Identifiers (PID) are employed to identify and access versions independently of physical storage location PIDs are generated and used as references within the framework B2HANDLE based on the handle system (http://www.handle.net) to reveal actual URL of a version. Every version is made cross-linked with the previous version for the sake of easy navigation between versions Figure 3. The template of B2SAFE replication provenance. It depicts entities as yellow ovals, activity as blue rectangle and agent as orange trapezium. The curly brackets denote the parameters used by the template. Provenance of research data is important for tracking origins, ownerships and modifications of data over the data lifecycle. The concept of provenance guarantees that the data sets made available for sharing and exchange are reliable and hence all data transformations and results obtained using the data sets could be reproduced. An HTTP template-based service collecting provenance information is developed to be used by any service from the EUDAT CDI portfolio The service defines an API for the clients to generate provenance data based on particular templates (Notation3 format) based on the PROV Ontology (PROV -O), which are made available by its operator The gathered provenance data then can be queried by any interested EUDAT service by using HTTP protocol

Transcript of Data Curation and Provenance for the EUDAT Collaborative Data … · 2018-11-12 · Data Curation...

Page 1: Data Curation and Provenance for the EUDAT Collaborative Data … · 2018-11-12 · Data Curation and Provenance for the EUDAT Collaborative Data Infrastructure Alexander Atamas 1,

Data Curation and Provenance for the EUDAT CollaborativeData Infrastructure

Alexander Atamas1, Rene van Horik1, Linda Reijnhoudt1, Javier Quinteros2, Vasily Bunakov3

1 Data Archiving and Networked Services (DANS), The Hague, the Netherlands2 GFZ German Research Centre for Geoscience, Potsdam, Germany

3 Science and Technology Facilities Council (STFC), Didcot, the United Kingdom

Figure 1. Data workflow of the HERBADROP data pilot. Figure 2. Data workflow from the GEOFON pilot.

HERBADROP offers an archival service for long-termpreservation of herbarium specimen images and extracts

information from those images.

GEOFON is a seismological data centre providing long-term access for seismic waveforms as well as a data

provision centre.

• B2SAFE service is used in the first step of the ingestion process into the EDTR• B2SAFE transfer service is engaged to transmit existing images of herbarium

specimens along with the associated metadata to the CINES repository• B2HANDLE is used to manage and store Persistent Identifiers

• B2SAFE is employed to accomplish data management tasks• B2SAFE actions are executed after new data is detected• B2HANDLE is used to manage and store Persistent Identifiers

Two Use Cases as a Proof-of-Concept for Data Curation Activities

Versioning of Research Data

Provenance of Research Data

Figure 4. Data workflow of the B2SAFE versioning.

Versioning is required because research data often undergo rather intensive lifecycle. Individual data sets are modified, updated or recalculated. Hence,in the scientific community, there is a need for versioning functionality for proper data curation and provenance of research data. A version of digitalobject is understood as a timestamped copy of the object.

test.txt__v1

PIDURLEUDAT/CHECKSUM_TIMESTAMPEUDAT/PROFILE_VERSIONEUDAT/CHECKSUMEUDAT/FIXED_CONTENTEUDAT/LATEST_VERSIONEUDAT/WAS_DERIVED_FROM

PIDURLEUDAT/DO_VERSION_NUMBEREUDAT/FIXED_CONTENTEUDAT/CHECKSUMEUDAT/PROFILE_VERSIONEUDAT/CHECKSUM_TIMESTAMPEUDAT/FIOEUDAT/VERSION_SOURCEEUDAT/REVISION_OFEUDAT/ROREUDAT/WAS_DERIVED_FROM

PIDURLEUDAT/DO_VERSION_NUMBEREUDAT/FIXED_CONTENTEUDAT/CHECKSUMEUDAT/PROFILE_VERSIONEUDAT/CHECKSUM_TIMESTAMPEUDAT/FIOEUDAT/VERSION_SOURCEEUDAT/REVISION_OFEUDAT/ROR

test.txt test.txt__v2

Source file Version #1 Version #2• Versions are stored in a separate directory created

only for storage of versions.• Persistent Identifiers (PID) are employed to

identify and access versions independently ofphysical storage location

• PIDs are generated and used as referenceswithin the framework B2HANDLE based onthe handle system (http://www.handle.net) toreveal actual URL of a version.

• Every version is made cross-linked with theprevious version for the sake of easynavigation between versions

Figure 3. The template of B2SAFE replication provenance. It depicts entities as yellow ovals, activity as blue rectangle and agent as orangetrapezium. The curly brackets denote the parameters used by the template.

Provenance of research data is important for tracking origins, ownerships andmodifications of data over the data lifecycle. The concept of provenanceguarantees that the data sets made available for sharing and exchange arereliable and hence all data transformations and results obtained using the datasets could be reproduced.

• An HTTP template-based service collecting provenance information isdeveloped to be used by any service from the EUDAT CDI portfolio

• The service defines an API for the clients to generate provenance databased on particular templates (Notation3 format) based on the PROVOntology (PROV-O), which are made available by its operator

• The gathered provenance data then can be queried by any interestedEUDAT service by using HTTP protocol