EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016

Post on 16-Apr-2017

503 views 0 download

Transcript of EUDAT & OpenAIRE Webinar: How to write a Data Management Plan - July 7, 2016

How to write a Data Management Plan

Sarah Jones (DCC)Marjan Grootveld (DANS)

both involved in EUDAT and OpenAIRE

This work is licensed under the Creative Commons CC-BY 4.0 licence

Open Access Infrastructure for Research in Europe

www.openaire.eu

Who we are

Research Data Services, Expertise & Technology https://www.eudat.eu

Joint webinar held on 26 May 2016 covering:• Reasons to manage data• Horizon 2020 Open Research Data Pilot• How to manage and share data• EUDAT & OpenAIRE services

Slides, webinar recording and Q&A document online

www.openaire.eu/research-data-management-an-introductory-webinar-from-openaire-and-eudat

Introduction to RDM

• What is a DMP and why write one?

• Requirements under Horizon 2020

• Example plans

• Lessons and guidance

Overview

WHAT IS A DMP & WHY WRITE ONE?Image CC-BY-NC-SA by Leo Reynolds www.flickr.com/photos/lwr/13442910354

A DMP is a brief plan to define:• how the data will be created• how it will be documented• who will be able to access it• where it will be stored• who will back it up• whether (and how) it will be shared & preserved

DMPs are often submitted as part of grant applications, but are useful whenever researchers are creating data.

Data Management Plans

Why manage data?NON PECUNIAE INVESTIGATIONIS CURATORE SED VITAE FACIMUS PROGRAMMAS DATORUM

PROCURATIONIS(Not for the research funder, but for life we make data management plans)

• Make your research easier• Stop yourself drowning in irrelevant stuff• Save data for later• Avoid accusations of fraud or bad science• Write a data paper• Share your data for re-use• Get credit for it

CREATING DATA

PROCESSING DATA

ANALYSING DATA

PRESERVING DATA

GIVING ACCESS TO DATA

RE-USING DATA

Research data lifecycleCREATING DATA: designing research, DMPs, planning consent, locate existing data, data collection and management, capturing and creating metadata

RE-USING DATA: follow-up research, new research, undertake research reviews, scrutinising findings, teaching & learning

ACCESS TO DATA: distributing data, sharing data, controlling access, establishing copyright, promoting data PRESERVING DATA: data storage, back-

up & archiving, migrating to best format & medium, creating metadata and documentation

ANALYSING DATA: interpreting, & deriving data, producing outputs, authoring publications, preparing for sharing

PROCESSING DATA: entering, transcribing, checking, validating and cleaning data, anonymising data, describing data, manage and store data

Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle

What data organisation would a re-user like?

Planning trick 1: think backwards

CREATING DATA

PROCESSING DATA

ANALYSING DATA

PRESERVING DATA

GIVING ACCESS TO DATA

RE-USING DATA

DMP and data organisation exercises

Design a data organisation for the project (folder structure, file naming convention, …)

Research Data Netherlands data support training: http://datasupport.researchdata.nl/en/start-de-cursus/iii-onderzoeksfase/organising-data/

Data organisation

http://datasupport.researchdata.nl/en/start-de-cursus/iii-onderzoeksfase/organising-data

Planning trick 2: include RDM stakeholders

InstitutionRDM policy

Facilities

€$£Research funders

PublishersData Availability

policy

Commercial partners

https://www.openaire.eu/briefpaper-rdm-infonoads

Responsibilities in RDM

https://www.openaire.eu/briefpaper-rdm-infonoads

A DMP is about ‘keeping’ data

• Storing data < > archiving data• Archived data < > findable data• Findable < > accessible• Accessible < > understandable• Understandable < > usable

• A USB stick is not safe• A persistent ID is essential but no guarantee for

usability• Data in a proprietary format is not sustainable

• Findable– Assign persistent IDs, provide rich metadata, register in a searchable

resource,...

• Accessible– Retrievable by their ID using a standard protocol, metadata remain

accessible even if data aren’t...

• Interoperable– Use formal, broadly applicable languages, use standard vocabularies,

qualified references...

• Reusable– Rich, accurate metadata, clear licences, provenance, use of community

standards...

www.force11.org/group/fairgroup/fairprinciples

Making data FAIR

How to deal with data and context?

• Versioning, back-up, storage and archiving– During the project and in the long term

• Ethics, consent forms, legal access• Security and technical access• Usage licences

What should be preserved and shared?

• The data needed to validate results in scientific publications (minimally!).

• The associated metadata: the dataset’s creator, title, year of publication, repository, identifier etc.– Follow a metadata standard in your line of work, or a generic

standard, e.g. Dublin Core or DataCite, and be FAIR.– The repository will assign a persistent ID to the dataset: important

for discovering and citing the data. • Documentation: code books, lab journals, informed consent forms –

domain-dependent, and important for understanding the data and combining them with other data sources.

• Software, hardware, tools, syntax queries, machine configurations – domain-dependent, and important for using the data. (Alternative: information about the software etc.)

Basically, everything that is needed to replicate a study should be available. Plus everything that is potentially useful for others.

Research Data Alliance (RDA) http://rd-alliance.github.io/metadata-directory/standards/FAIR Guiding Principles for scientific data management & stewardship http://www.nature.com/articles/sdata201618How to select and appraise research data:www.dcc.ac.uk/resources/how-guides/appraise-select-research-data

DMPS IN HORIZON 2020 Image “Open Data” CC BY 2.0 by http://www.descrier.co.uk

Some funders that require DMPs

Common themes in DMPs1. Description of data to be collected / created

(i.e. content, type, format, volume...)

2. Standards / methodologies for data collection & management

3. Ethics and Intellectual Property(highlight restrictions on data sharing e.g. embargoes, confidentiality)

4. Plans for data sharing and access (i.e. how, when, to whom)

5. Strategy for long-term preservation

Start planning and communicating early

Horizon 2020: Open Research Data Pilot

http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf

• Open access to research data refers to the right to access and re-use digital research data. Openly accessible research data can typically be accessed, mined, exploited, reproduced and disseminated free of charge for the user.

• The use of a Data Management Plan (DMP) is required for projects participating in the Open Research Data Pilot, detailing what data the project will generate, whether and how they will be exploited or made accessible for verification and re-use, and how they will be curated and preserved.

H2020 - Open Data by Default from 2017

The RDM basics, tuned to Horizon 2020

• The EC’s goal is Open Access to research data: as open as possible, as closed as necessary.

• In H2020 the Data Management Plan (DMP) is a regular project deliverable, due by month 6.

• A DMP is a living document: to be used, updated and shared.

• You can use the H2020 template in DMPonline. • Deposit the data in a research data repository.

Look early for a research data repository for sharing and preserving the data long term.

• If (part of your) data cannot be shared with everyone, you may (partially) opt out of the pilot.

Timing the DMP• Note that the Commission does NOT require

applicants to submit a DMP at the proposal stage.

• A DMP is therefore NOT part of the evaluation.

• DMPs are a deliverable for those in the pilot.

• Note that the Commission requires updates. A DMP is a living or “active” document.

Initial DMP (at 6 months)The DMP should address the points below on a dataset by dataset basis:

• Dataset reference and name

• Data set description

• Standards and metadata

• Data sharing

• Archiving and preservation (including storage and backup)

More elaborate DMPScientific research data should be easily:

1. DiscoverableAre the data discoverable and identifiable by a standard mechanism e.g. DOIs?

2. AccessibleAre the data accessible and under what conditions e.g. licenses, embargoes?

3. Assessable and intelligibleAre the data and software assessable and intelligible to third parties for peer-review? E.g. can judgements be made about their reliability and the competence of those who created them?

4. Useable beyond the original purpose for which it was collected

Are the data properly curated and stored together with the minimum software and documentation to be useful by third parties in the long-term?

5. Interoperable to specific quality standardsAre the data and software interoperable, allowing data exchange? E.g. were common formats and standards for metadata used?

DMPonlineA web-based tool to help researchers write DMPs

Includes a template for Horizon 2020Guidance from EUDAT and OpenAIRE being added

https://dmponline.dcc.ac.uk

How the tool worksClick to write a generic DMP

Or choose your funder to get their specific template

Pick your uni to add local guidance and to get their template if no funder applies

Choose any additional optional guidance

EUDAT guidance

OpenAIRE support• Summary on the Open Research Data pilot

https://www.openaire.eu/opendatapilot

• Brief guide on developing a DMPhttps://www.openaire.eu/opendatapilot-dmp

• Selecting a data repositoryhttps://www.openaire.eu/opendatapilot-repository

• Developing guidance to add to DMPonline

• Will be adding an ‘export to Zenodo’ feature in early 2017 to allow DMPs to be published and assigned a DOI

Deliver the DMP and keep it up to date

• EC: “Since DMPs are expected to mature during the project, more developed versions of the plan can be included as additional deliverables at later stages. (…) New versions of the DMP should be created whenever important changes to the project occur due to inclusion of new data sets, changes in consortium policies or external factors.”

Focus on how you will ensure your data are “FAIR”

Active DMPs

• Interested in ways to support this active quality, where “active” is understood as “able to evolve and be monitored”?

• Join the RDA’s Active Data Management Plans interest group https://rd-alliance.org/groups/active-data-management-plans.html

• And see recordings, slides and notes of the international and interdisciplinary ADMP Workshop 28-30 June 2016 https://indico.cern.ch/event/520120

Option: add SSI template for software projects

Two templates available for Software Management Plans in DMPonline courtesy of SSI

www.software.ac.uk/resources/guides/software-management-plans

EXAMPLE PLANS

Example plans• 108 DMPs from the National Endowment for the Humanities

www.neh.gov/divisions/odh/grant-news/data-management-plans-successful-grant-applications-2011-2014-now-available

• 20+ scientific DMPs submitted to the NSF (USA) provided by UCSD

– http://libraries.ucsd.edu/services/data-curation/data-management/ dmp-samples.html

• Example DMP collection from Leeds University• https://library.leeds.ac.uk/research-data-tools

• Further examples: • www.dcc.ac.uk/resources/data-management-plans/guidance-example

s

Example: OpenMinTed

OpenMinTed aims to create an infrastructure

for Text and Data Mining (TDM) of

scientific and scholarly content

Have adopted their own structure to create a ‘Data and Software Management Plan’

http://openminted.eu

Example: OpenMinTed – Data chapter

Six high-level datasets identified:1. Scholarly publications 2. Language and knowledge resources 3. Services and workflows 4. Automatically and manually generated annotations 5. Consortium publications 6. Metadata

Described in a table per dataset (see illustration)

OpenMinTed – Software examples

Example: CAPSELLACAPSELLA aims to develop ICT solutions for farmers and other

actors engaged in agrobiodiversity

Devised a questionnaire to collate datset information from

project partners

Identified 13 datasets, 6 of which are imported as is, 3

aggregated, 3 transformed and 1 generated

www.capsella.eu

4 types of data• Core Datasets - datasets related to the main project activities.

The majority pre-exist CAPSELLA and are publicly available

• Produced Datasets - datasets resulting from CAPSELLA’s pilot applications. These include sensor data, field data and user related datasets.

• Project Related Data - datasets resulting from the operation of the project. They are collections of standard material e.g. deliverables, dissemination material, training material, scientific publications

• Software - datasets resulting from the software developed in the frame of CAPSELLA. These datasets are mainly either software artefacts and source code and can be used for various purposes including research tasks or the development of new software components.

Example dataset record

Differing priorities?

Data description examples

The final dataset will include self-reported demographic and behavioural data from interviews with the subjects and

laboratory data from urine specimens provided. From NIH data sharing statements

Every two days, we will subsample E. affinis populations growing under our treatment conditions. We will use a microscope to

identify the life stage and sex of the subsampled individuals. We will document the information first in a laboratory notebook and

then copy the data into an Excel spreadsheet. The Excel spreadsheet will be saved as a comma separated value (.csv) file.

From DataOne – E. affinis DMP example

Metadata examplesMetadata will be tagged in XML using the Data Documentation

Initiative (DDI) format. The codebook will contain information on study design, sampling methodology, fieldwork, variable-level detail,

and all information necessary for a secondary analyst to use the data accurately and effectively.

From ICPSR Framework for Creating a DMP

We will first document our metadata by taking careful notes in the laboratory notebook that refer to specific data files and describe all columns, units,

abbreviations, and missing value identifiers. These notes will be transcribed into a .txt document that will be stored with the data file. After all of the

data are collected, we will then use EML (Ecological Metadata Language) to digitize our metadata. EML is one of the accepted formats used in ecology,

and works well for the types of data we will be producing. We will create these metadata using Morpho software, available through KNB. The

metadata will fully describe the data files and the context of the measurements.

From DataOne – E. affinis DMP example

Data sharing examples

We will make the data and associated documentation available to users under a data-sharing agreement that provides for: (1) a commitment to using the data

only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or returning the data after analyses are completed. 

From NIH data sharing statements

The videos will be made available via the bristol.ac.uk website (both as streaming media and downloads) HD and SD versions will be provided to

accommodate those with lower bandwidth. Videos will also be made available via Vimeo, a platform that is already well used by research students at Bristol.

Appropriate metadata will also be provided to the existing Vimeo standard.

All video will also be available for download and re-editing by third parties. To facilitate this Creative Commons licenses will be assigned to each item. In order to ensure this usage is possible, the required permissions will be gathered from

participants (using a suitable release form) before recording commences.

From University of Bristol Kitchen Cosmology DMP

Examples restrictionsBecause the STDs being studied are reportable diseases, we will be

collecting identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains the possibility of deductive disclosure of subjects with unusual

characteristics. Thus, we will make the data and associated documentation available to users only under a data-sharing agreement.

From NIH data sharing statements

1. Share data privately within 1 year. Data will be held in Private Repository, but metadata will be

public 2. Release data to public within 2 years.

Encouraged after one year to release data for public access. 3. Request, in writing, data privacy up to 4 years.

Extensions beyond 3 years will only be granted for compelling cases.4. Consult with creators of private CZO datasets prior to use.

Pis required to seek consent before using private data they can access

From Boulder Creek Critical Zone Observatory DMP

Archiving examplesThe investigators will work with staff at the UKDA to determine

what to archive and how long the deposited data should be retained. Future long-term use of the data will be ensured by

placing a copy of the data into the repository.From ICPSR Framework for Creating a DMP

Data will be provided in file formats considered appropriate for long-term access, as recommended by the UK Data Service. For example, SPSS Portal format and tab-delimited text for qualitative tabular

data and RTF and PDF/A for interview transcripts. Appropriate documentation necessary to understand the data will also be provided. Anonymised data will be held for a minimum of 10 years following project completion, in compliance with LSHTM’s

Records Retention and Disposal Schedule. Biological samples (output 3) will be deposited with the UK BioBank for future use.

From Writing a Wellcome Trust Data Management and Sharing Plan

Share your example DMPs!

Send us links to your DMPs

We will add them to the DCC list

Aim to cover wide range of disciplines

and funders

www.dcc.ac.uk/ share-DMPs

LESSONS AND RESOURCESImage ‘Energy Resources | Energie Quelle’ CC-BY-NC by K. H. Reichert www.flickr.com/photos/reupa/19502634575

Tips for writing DMPs

• Seek advice - consult and collaborate

• Consider good practice for your field

• Base plans on available skills & support

• Make sure implementation is feasible

• Think about things early…

Plan to share data from the outset

• Negotiation on licenses and consent agreement may preclude later sharing if not careful

• Costings can’t be included retrospectively

• Useful to consider data issues at the consortium negotiation stage to make sure potential issues are identified and sorted asap

Decisions made early on affect what you can do later

Sharing data: what is meant?

With collaborators while research is active

Data are mutable

(Open) data sharing

Data are stable, searchable, citable,

clearly licensed

Storing data: what is meant?

Storing and backing up files while research is active

Likely to be on a networked filestore or hard drive

Easy to change or delete

Archiving or preserving data in the long-term

Likely to be deposited in a digital repository

Safeguarded and preserved

Archiving, repositories, ehm?

• Horizon 2020 ORD pilot participants are asked to “deposit your data in a research data repository”: a digital archive collecting and displaying datasets and their metadata.

• Select a data repository that will preserve your data, metadata and possibly tools in the long term.

• It is advisable to contact the repository of your choice when writing the first version of your DMP.

• Repositories may offer guidelines for sustainable data formats and metadata standards, as well as support for dealing with sensitive data and licensing.

Where to find a repository?

• More information: https://www.openaire.eu/opendatapilot-repository• Zenodo: http://www.zenodo.org • Re3data.org: http://www.re3data.org

Searching with Re3data.org

www.fosteropenscience.eu/content/re3data-demo

How to select a repository? 1/2

• Main criteria for choosing a data repository:Certification as a ‘Trustworthy Digital Repository’, with an explicit ambition to keep the data available in the long term.

• Three common certification standards for TDRs:

Data Seal of Approval: http://datasealofapproval.org/ennestor seal: http://www.langzeitarchivierung.de/Subsites/nestor/EN/nestor-Siegel/siegel_node.htmlISO 16363: http://www.iso16363.org

How to select a repository? 2/2

• Main criteria for choosing a data repository:Certification as a ‘Trustworthy Digital Repository’, with an explicit ambition to keep the data available in long term.

• Matches your particular data needs: e.g. formats accepted; mixture of Open and Restricted Access.

• Provides guidance on how to cite the data that has been deposited.

• Gives your submitted dataset a persistent and globally unique identifier: for sustainable citations – both for data and publications – and to link back to particular researchers and grants.

www.openaire.eu/opendatapilot-repository

Licensing research data• Horizon 2020 guidelines point to CC-BY or CC-0

• EUDAT licensing wizard help you pick licence for data & software

http://ufal.github.io/public-license-selector

• DCC How-to guide helps you to license datawww.dcc.ac.uk/resources/how-guides/license-research-data

• How to develop a DMP www.dcc.ac.uk/resources/how-guides/develop-data-plan

• RDM brochure and template https://dans.knaw.nl/en/about/organisation-and-policy/information-material?set_language=en

• OpenAIRE guidelines• www.openaire.eu/opendatapilot-dmp

• ICPSR framework for a DMP www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/framework.html

Guidelines on DMPs

KEY MESSAGESImage “Fishbone” CC BY-NC-ND 2.0 by ttps://www.flickr.com/photos/mrjnl/

Key messages• The principles of good research conduct hold for all

of us, across disciplinary boundaries.

• Data management is all in a day’s work.

• Planning and reflection are more important than the plan – but write the DMP and keep it up to date.

• Planning data management is team work.

• Think about the desired end result and plan for this.

• Decisions made early affect what you can do later.

www.eudat.eu www.openaire.eu

Thanks – any questions?Contact us:

Marjan Grootveld: marjan.grootveld@dans.knaw.nl Sarah Jones: sarah.jones@glasgow.ac.uk

Acknowledgements:

Thanks to DANS and DCC for reuse of slides, and to the OpenMinTeD and CAPSELLA projects for sharing their Data Management Plans