Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many...
Transcript of Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many...
University Library.
Report to the University of Sheffield Research Data Management Service Delivery Group
Briefing Paper on RDM Technical Infrastructure
Date: 05/11/2014
Author: John A. Lewis
Briefing Paper on RDM Technical Infrastructure
The provision of a technical infrastructure for RDM is intended to satisfy the researchers’ RDM needs
and to make the work of the researcher and institution easier. An appropriately designed technical
infrastructure will help researchers to achieve good research practice simply by utilising the facilities
(without thinking about RDM), to fulfil their obligations to the institution and funders without extra
work, and to have all processes involved work together seamlessly.
The Research Data Life-cycle
This is a means of visualising the flow of research data and the processes involved, during a generic
research project (see fig. 1). At various points during the Research Data Lifecycle, provision of
appropriate technical infrastructure may greatly benefit researchers’ workflows:-
1. Research Planning
It is widely accepted that RDM represents good research practice, and as such the creation of a Data
Management Plan (DMP) offers the researcher an opportunity to determine the most appropriate
data management procedures to apply during and after a research project. Creating a DMP will help
the researcher avoid errors, particularly research data loss disasters, and time-consuming
retrospective data management. A DMP will benefit the overall planning of the research project.
2. Active Research Data Management
‘Active’ or ‘live’, data are the research data created or collected, and processed or derived during the
active phase of the research project. Once underway, a research project will collect or create raw
data, which, during the project, will usually be processed to create derived or processed data. There
may be many different iterations of processing, resulting in many sets of derived data, and,
eventually a set of ‘results’ data selected as the basis of the research publication(s) output by the
project. All these sets of data can be considered active data, which will need to be quickly accessible,
and easily shared between collaborators, involving stringent security arrangements.
3. Data Documentation
In order to identify, organise, find and retrieve active data, appropriate documentation is essential.
Data documentation also indicates the conditions and processes involved in the creation or
collection of the data, the processing of the data and the context of the research - Detailed
documentation is essential for verification and reuse. Metadata are a highly structured subset of
core data documentation. Metadata are structured so that they may be indexed and stored within a
database, thereby facilitating data organisation and discovery, and machine to machine
interoperability. Metadata are necessary to determine provenance, licensing and access
arrangements, preservation requirements and for discovery and citation.
Ideally, the ‘metadata capture’ process (‘data cataloguing’) should be automated where possible to
reduce the amount of manual annotation required of researchers. As well as reducing ‘double-
keying’, which is frustrating for researchers, the number of errors inevitably introduced through
manual input, is reduced
4. Data Selection
Because it is not feasible to preserve all the research data produced during a project, for reasons of
cost and discoverability, therefore a process of data selection / appraisal needs to be carried out.
Preservation of some research data may be a condition of funding. The responsibility for this process
will lie with a Data Librarian or Archivist. Data may need ‘cleaning’ by editing corrupt or incorrect
elements to ensure integrity. Data not selected for preservation will need deletion in an appropriate
manner.
5. Data Archiving
Data selected for preservation will be usually ingested into a Digital Repository where they may be
actively preserved (or curated), ensuring they remain immutable (must never change), but accessible
(in usable formats) in the long-term (beyond 10 years). ‘Read only’ access is required and slower
access time will be acceptable. There may be no request for access for long periods of time if ever. A
copy of the dataset may be held in a local cache for quick access, otherwise an access copy will be
requested.
6. Data Publishing
Datasets are published by making the associated metadata records available for discovery through a
catalogue and making the files available for download from appropriate storage. Whereas the
metadata will probably be openly accessible, there may be access conditions in place to control
access to the data themselves. Further processing of the data, such as anonymisation or redaction,
may be necessary to make them publicly accessible.
7. Data Discovery and Reuse
In many disciplines, Research Data may now be considered a primary research output, to be
discovered, reused, cited and achieve impact. Such published data may be processed and
manipulated in different ways to those of the original creator / collector, or combined with other
datasets to derive further results and conclusions. Thus the funder achieves more return on their
original funding, the researcher achieves more impact from their research and researchers will
waste less time and effort duplicating research.
Institutional RDM Policy & Funder Requirements
In line with the RCUK common principles on data policy1, the EPSRC Expectations2 of organisations
receiving EPSRC funding, include the requirements that the organisation will:
Publish appropriately structured metadata describing the research data they hold -
therefore the institution must create a public data catalogue.
Ensure that EPSRC-funded data is securely preserved for a minimum of ten years – therefore
the institution must create a data archive.
Ensure that effective data curation is provided throughout the full data lifecycle – therefore
the institution must provide the necessary human and technical infrastructure required.
Institutions in receipt of EPSRC funding are expected to be compliant with these expectations by 1st
May 2015. The University of Sheffield Research Data Management Policy3 was developed in
response to the RCUK principles and EPSRC expectations. This states that the University will develop
infrastructure and services to support research data management in consultation with researchers.
Functional Components of the Technical Infrastructure
Many implementations of RDM technical infrastructure have involved an overlap in the functions
provided by the different components (see fig.2). Many components have been designed around the
functional requirements determined by researcher workflows, however, most implementations have
had to take into account existing systems and modifications have been engineered to ensure
interoperability. The major functional requirement of every component is interoperability –
achieved chiefly by adherence to data and metadata standards.
1. CRIS
Depending upon the particular system and configuration, the Current Research Information System
holds information about the researcher, project, grant and funder. The CRIS will be interoperable
with the HR system, Grants / Awards management system and possibly other systems such as
research costing, facilities management and the institutional CMS. The CRIS provides a register
(inward facing catalogue) of the researcher’s published outputs and may act as a means to deposit
publications into the Institutional Repository if interoperable. The CRIS may feasibly be used to push
metadata only records of research data to the Institutional Repository, or in itself, function as a
catalogue of the institution’s published research data outputs. The ‘Pure’ CRIS offers a public facing
catalogue facility – the ‘Pure Portal’ which can function as an Institutional Repository.
2. Data Management Planning Tools
These are facilities for aiding researchers compile the DMP required by some funding organisations
to be submitted as part of the grant application process. Such tools, (such as the DCC’s DMPonline
Tool) provide templates for the different major funders and may be customised for the institution.
The tools may be accessed via institutional login and the DMPs created, stored in the institutional
CRIS, Research Management System or by a DMP service.
3. Data and Metadata Capture
Metadata capture (or data cataloguing) may be accomplished simply by providing an interface for
researchers to fill out online forms. Automatic metadata capture, concurrent with data capture, may
be facilitated by using appropriate instruments and equipment and save data to the laboratory,
departmental or facility file store or the institutional network. Electronic lab books, electron
microscopes and other imaging instruments, genetic sequencing and analysis instruments usually
have attached local storage or may feed data to a project based Laboratory Information
Management Systems (LIMS). Ideally, data and metadata need to be transferred to central active
data storage.
4. Active Data Storage (including HPC Grid storage)
Active research data need to be rapidly accessed, easily shared between collaborators with access
being controlled through stringent security arrangements. It may be necessary to differentiate
between data requiring ‘Read-Write’ and data requiring ‘Read-only’ access, to provide cost-effective
storage. Working datasets, which change constantly (as they are being created, added to, processed
and edited), will require Read-Write access, frequent back-up and may require large computational
resources. However, much ‘active data’ will be immutable and only require Read-only access, so
therefore may possibly be moved to ‘Archive data’ storage.
5. Active Data Management / Collaboration Management
At the University of Sheffield, collaborative computing is provided through the shared
(departmental) networked drives, the HPC Grid service and Google drive (Google Apps for Education
cloud storage). Collaborative computing systems, such as HPC Grids, Dropbox-like cloud storage
services, Virtual Research Environments (VRE) and Laboratory Information Management Systems
(LIMS), have been developed to accommodate the need for secure, read-write access to shared
storage. Active data management systems may be considered to be comprised of three functional
components: a storage layer, a data registry (or metadata store or asset registry) database layer and
a User interface layer. In some cases these components will be integrated into a single system, in
other cases, the metadata may be handled by the CRIS and storage by the institutional network.
6. Research Data Selection / Deposit Facility
The institution may provide a service to help researchers appraise their data, assess the preservation
requirements, help with submission to the institutional repository and help with submission to
external repositories.
7. Archive Data Storage and Digital Preservation
Many archive data storage arrangements distinguish between permanent archival storage, maybe
through an external service, and operational storage on a local server, holding ingested files for
processing and access copies of the data. This distinction is due to the slow access speed and higher
cost of retrieval from archival storage. Control of the system will be mediated by a Storage resource
broker.
Data selected for long-term preservation will require storage that ensures the file remains
immutable. Such ‘Bit preservation’ requires constant management (regular checksum) and back-up
to a variety of media including tape storage and off-site or cloud storage. A number of vendors offer
a digital archiving service (Arkivum, AWS); because of the high costs involved in retrieval, such
archival storage services are most appropriate for back-up copies of ‘canonical’ data. Digital
preservation involves ensuring that the material will remain accessible in perpetuity (through format
migration), as well as ensuring ‘bit preservation’ data immutability.
A Research Data Archive preserves data not, or not yet, submitted to discipline-based data
repositories. The associated metadata records are held in the research data registry or catalogue.
8. Research Data Registry
The Registry is defined here as an inward-facing catalogue that holds the metadata records of
unpublished research data. The data themselves will be held in the institutional data archive. The
data and metadata may be eventually published by ingest into a discipline-based data repository
outside the institution or into the Institutional Data Repository. The CRIS may function as a research
data registry, providing researchers with an interface to record metadata in order to register a
dataset.
9. Research Data Repository
A repository may be defined as a Digital Asset Management System (DAMS), consisting of three
layers: A storage layer back-end (may be a hybrid of local, external and cloud storage); A ‘metadata
store’ database layer; A user interface or access platform front-end. A wide range of repositories
systems are available: proprietary, externally managed services to open source software systems.
Many have been designed to manage specific media, or designed to provide digital preservation or
Research data management specific functions. A Digital Preservation System is a DAMS / repository
that manages the active preservation of the content as well as the storage and metadata
management and access interface.
Archive data is possibly best managed in a discipline based repository or data centre, whilst the
Institutional repository may be considered the ‘repository of last resort’. The Institutional Research
Data Repository (or the institutional repository, if it has been modified to accommodate data) is an
appropriate home for datasets for which there is no discipline based repository or data centre, or for
temporary storage before being submitted to a data centre. The Institutional Research Data
Repository may provide a public catalogue for all published data created at the institution – the data
being held by data centres / discipline based repositories as well as that held by the institution.
The catalogue and archive storage functions of the repository may be separated. In such an
arrangement, the archive storage function may be achieved using an external service or in an
institutional data archive, but access and deposit managed seamlessly through the repository
platform. Where the repository holds only metadata records, it may be considered a Research Data
Catalogue.
10. Research Data Catalogue
The Catalogue is defined here as a publicly-accessible catalogue, holding the metadata records of
published research data. The data themselves may be held in a discipline-based data repository
outside the institution or in an institutional data archive. The selection of the underlying metadata
schema is fundamental and consideration must be given to the schema used by the proposed
National Data Registry. Many institutions favour the Datacite metadata schema4, subscription to
which provides the means to mint DOIs and assurance of a standard level of preservation. The
Research Data catalogue may be provided by a number of repository platforms, access platforms or
catalogue software systems.
John A. Lewis 05/11/14
1 RCUK common principles on data policy http://www.rcuk.ac.uk/research/datapolicy/
2 EPSRC Expectations http://www.epsrc.ac.uk/about/standards/researchdata/Pages/expectations.aspx
3 University of Sheffield Research Data Management Policy
http://www.shef.ac.uk/ris/other/gov-ethics/grippolicy/practices/all/rdmpolicy 4 Datacite metadata schema http://schema.datacite.org/
Planning Data Management
Organising & Documenting
Your Data
Storing & Securing Your
Data
Preserving & Sharing Your
Data
http://ukdataservice.ac.uk/media/132177/data_lifecycle_recolour.png
http://datalib.edina.ac.uk/mantra/datamanagementplans/media/RDMcycle.png
Fig.1 Some representations of the Research Data Lifecycle
Data Centre / Disciplinary Repository
Institutional Research Data
Catalogue
Data Registry Data Archive /
Storage
Active Data Management
System
Archive Data Storage
(for Public Access)
Active Data
Registry
Manual Data documentation
Automatic Metadata Capture
SWORD2
OAI-PMH
File metadata
Dat
a P
roce
ssin
g
Act
ive
Dat
a A
rch
ive
Dat
a
External SWORD2 SWORD2
Research Data Flow Metadata Flow
Institution / Virtual Org
Research Data Archive
(Digital Preservation)
Archive Data Storage
(Digital Preservation)
Institutional Research Data
Repository
Researcher & Project metadata
CRIS
Institutional Research
Data Registry
Dataset Metadata & Description at project end (Manual Upload?)
Dataset Metadata & Description on Dataset Publication
Man
ual D
ata Do
cum
en
tation
on
Datase
t Pu
blicatio
n
Dataset Deposit at project end
Pri
vate
->
<
- P
ub
lic
Dataset Publication
Dat
ase
t P
ub
licat
ion
D
atas
et
Acc
ess
& R
eu
se
<- Data Metadata ->
Dataset Preservation
Data Capture
Instrument
Active Data Storage
Fig.2 – RDM Technical Infrastructure Dataflow
DMP Service
DM
Plan
s
Researcher & Project metadata
HR & Research Management
Researcher