This Photo by Unknown Author is licensed under CC BY-SA

59
This Photo by Unknown Author is licensed under CC BY-SA @chochrphd

Transcript of This Photo by Unknown Author is licensed under CC BY-SA

This Photo by Unknown Author is

licensed under CC BY-SA

@chochrphd

Topics Of Discussion

• Overview of Research Data Management

• Data Lifecycle

• Data Management Plan (DMP)

• Websites-Resources

Definition

Importance

Research Data Management

What is research data?

Research Data Management Defined• Statistical Records

• Video & Audio recordings

• Images

• Measurements

• Software & Code

• Algorithms

• Lab notebooks

• Biospecimens

Research Data Management Defined

Research Data Management

• The organization, storage, preservation, and sharing of data collected and used in a research project.

✓ Everyday management of research data during lifetime of a project

✓ Decisions about how data will be preserved and shared afterthe project is completed

Research Data Management Defined

Importance of research data management• Verify the integrity of your data

• Make your data findable and reusable

• Help others understand your data

• Encourage other researchers to reuse and cite your data

• It is required by some funding agencies

Data Life Cycle

Best Practices

Components of Data Management

Data Lifecycle

Research Data Management Defined

Plan

Discover

Collect & Organize

QualityDescribe

Store

Share

PLAN

Stage 1 Things to consider:

• Policies

• Type of data

• Versions

• Backup

• Describing and labeling

• Access and Sharing

• Rights and Permissions

• Roles and Responsibilities

• Budget

Stage 1: Plan

Data Management Plan (DMP)• A document that describes how you will treat your data throughout a project and

what happens with the data after the project ends.

• Some funding agencies require a Data Management Plan

Stage 1: Plan

• Data Management Plans address:

1. Data Type

2. Data Format

3. Data Sharing Plan

4. Data Archiving/Preserving Plan

https://old.dataone.org/data-management-planning

https://dmptool.org/

EXAMPLE: National Science Foundation (NSF) DMP

Length

• The data management plan is a supplementary document.

• Plans should be no longer than two pages.

Components

• Types of data produced

• Data and metadata standards

• Data access and sharing

• Data re-use and re-distribituion

• Archiving and preservation

EXAMPLE: National Science Foundation (NSF) DMP

Types of data produced:

The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project.

EXAMPLE: National Science Foundation (NSF) DMP

Types of data produced:

Questions To Ask

1. What types of data will be produced for your project?

2. How will the data be created or captured?

3. What software programs will be used to generate your data?

4. How much data will be produced?

5. How big will your digital files be and how many will there be?

6. Will you be using existing data? If so, what is the source of that data?

EXAMPLE: National Science Foundation (NSF) DMP

Data and Metadata Standards

The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies).

EXAMPLE: National Science Foundation (NSF) DMP

Data and Metadata Standards

Questions to Ask

1. How will you document your data and project?

2. What file formats will you be using in your project and why?

3. How will you organize your files into directories, and what naming conventions will you use?

4. How often will your data change or be updated, and will versions need to be tracked?

5. What types of metadata do you need to collect in order for someone else to fully understand your data?

EXAMPLE: National Science Foundation (NSF) DMP

Data Access, Sharing, Reuse, and Redistribution

Policies for access and sharing, including provisions for the appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements. Policies and provisions for reuse, redistribution, and the production of derivatives.

EXAMPLE: National Science Foundation (NSF) DMP

Data Access, Sharing, Reuse, and Redistribution Questions to Ask

1. Who is responsible for managing and controlling your data?

2. Who is likely to be interested in your data and what are the foreseeable future uses of the data?

3. When and where do you intend to publish or distribute your data?

4. How will the data be made available?

5. Will there be an embargo period before the data is made available for wider distribution? If so, explain why.

6. Are there issues regarding privacy or restricted, confidential, or sensitive data?

7. How have you addressed any institutional review board (IRB) protocols that may apply to your research?

8. Are there intellectual property issues or agreements with industry or government agencies that affect sharing?

9. If you are using data from other sources, do you have the right to share that data?

EXAMPLE: National Science Foundation (NSF) DMP

Storage and Preservation

Plans for archiving data, samples, and other research products, and for preservation of access to them.

EXAMPLE: National Science Foundation (NSF) DMP

Storage and Preservation

Questions to Ask

1. What is your strategy for data storage and backup?

2. What data will be preserved for the long term?

3. Are extra steps required to prepare the data for preservation?

4. What related information or metadata will be preserved along with the data?

5. Where and how will the data be preserved?

6. What procedures does the archive have in place to ensure preservation and backup?

7. How long will the data be kept after the project is completed?

Discover

Stage 2 Things to consider:

• Locate existing data

• Cite data

Stage 2: Discover

Locate existing data

• Data Directories (e.g. re3data, OpenAccessDirectory)

• General Repositories (e.g. figshare)

• Discipline-related repositories (e.g. DRYAD for life sciences)

• Data Journals (e.g. https://www.nature.com/sdata/)

DataCite is an international organization that helps researchers to find, access, and use data.

Citing Data

Stage 2: Discover

Provide proper recognition.

Cite datasets

Follow standards

Collect & Organize

Stage 3 Things to consider:

• Finding and reusing data

• Choosing a file format

• Naming data files

• Data versioning

Organize your data

Stage 3: Collect and Organize

• Name

• Format

• Version

File name

Stage 3: Collect and Organize

Before staring your project,

decide on a naming convention

for your files.

1. Meaningful

2. Length

3. Underscores & Hyphens

4. YYYYMMDD

5. Zeros

6. No special characters

7. Versions

• Stanford University Libraries - Data Management Services• University of Wisconsin Research Data Services• Purdue University Libraries - Data Management for Graduate Researchers• Cornell University Research Data Management Service Group

Stage 3: Collect and Organize

File Format• Choose one and stick to it

• Consider the software that will be used to access data

• Repository requirements

• Lost features during conversion

• Stanford University Libraries - Data Management Services

• Cornell University Research Data Management Service Group

• Cambridge University Libraries - Data Management

Stage 3: Collect and Organize

Data Versioning

Saving new copies of your files when you make changes so that you can go back and retrieve specific versions of your files later.

DataFileName_1.0 = original documentDataFileName_1.1 = original document with minor revisionsDataFileName_2.0 = document with substantial revisions

Data Versioning

Style 1: end of the file name.

image1_v1.jpg

image1_v2.jpg

image2_v1.jpg

image2_v2.jpg

Data Versioning

Style 2: Dated

image1_20151021

image1_20151214

image1_20160123

Data Versioning

Style 1: incorporate names or initials of collaborators

dataset1_20160402_KES

dataset1_20160301_WTC

dataset1_20160814_GSC

Things to consider:

• Assurance (QA)

• Control (QC)

• QA/QC PlanData Quality

Stage 4

Stage 4: Data Quality

Quality Assurance vs. Quality Control• Assurance: Process oriented and focuses on defect prevention

• Control: Product oriented and focuses on defect identification.

Stage 4: Data Quality

Help others

understand how

to use data

Avoid mistakes

due to poor data

quality

Track errors and

conflicts

Importance of QA/QC plan

Stage 4: Data Quality

QA/QC plans should include

• Methods to deal with erroneous data (Assurance)

• Methods to identify erroneous data (Control)

• Methods to mark erroneous data (Control)

Stage 4: Data Quality

Methods

• Consistent techniques, processes, and environments

• Mechanisms to compare data sets

• Scripts or macros

Data Description

Stage 5 Things to consider:

• Metadata

• Data Dictionary

Components of data description

Stage 5: Data Description

• Describe scientific context

• Include critical information

• Identifiers within datasets

• Create a data dictionary

Metadata

Stage 5: Data Description

• “Data about data” (context)

• Description of your research data

Stage 5: Data Description

Makes your data

easier to find.

Increases

understanding and

reusability of data.

Makes your data and

associated research

verifiable

What does metadata do?

Stage 5: Data Description

What to include in metadata• General Information

• Data and File Overview

• Methodological Information

• Data specific-information

who created the data

what the data file contains

when the data were generated

where the data were generated

why the data were generated

how the data were generated.

Stage 5: Data Description

Where can metadata be collected?

• Lab notebooks

• Plain text README files

• Within data file

• Web forms

Data Dictionary

Stage 5: Data Description

• Describes all the data stored in a data set or used by a

database

• Describes the data, does not contain the data

Components of data dictionary

Stage 5: Data Description

• List of all files

• Type of data included

• List of field and variable names

• Description of information contained in each

field

Examples:

• Ag Data Commons

• National Renewable Energy Laboratory (NREL)

• Protein Data Bank Exchange Data Dictionary (PDBx/mmCIFV4.0)

Data Storage

Stage 6 Things to consider:

• Size of dataset

• Computational requirements

• Backup

• Security

Stage 6: Data Storage

Keep an original

copy

Second local copy Remote copy

Backup rule of Three

Data Share

Stage 7 Things to consider:

• Benefits

• Location

• Preparation

Stage 7: Data Sharing

Benefits of Data Sharing• Promote new discoveries

• Enhance Impact

• Support Validation

• Encourage Collaboration

• Increase Public investment

• Reduce redundancy

Stage 7: Data Sharing

Locations

• Disciplinary repository

• Data journal

• Supplementary File

• Web-based tools

Stage 7: Data Sharing

Preparation for sharing

• Use consistent and meaningful file names

• Use self-explanatory variable names and abbreviations

• Remove redundant variables and labels

• Apply anonymization as needed

• Check copyright and privacy permissions

Summary

• Data management is the organization, storage, preservation, and sharing of data collected and used in a research project.

• Data management is critical in every stage of the data lifecycle

• Things to always remember:

• RECORD and TRACK

• NAME FILES

• STORE and BACKUP

• GUIDELINES and REQUIREMENTS

Free Data Management software

Service Description

Adobe Bridge Adobe Bridge is free software for locally organizing images.

Figshare

Figshare is a multidisciplinary repository where users can make all of their research outputs available in a citable, shareable and discoverable manner. Figshare allows users to upload any file format to be made visualisable in the browser so that figures, datasets, media, papers, posters, presentations and filesets can be disseminated. Figshare uses Datacite DOIs for persistent data citation.

Open Science Framework

The Open Science Framework (OSF) is a free, open source web application that connects and supports the research workflow, enabling scientists to increase the efficiency and effectiveness of their research. Researchers use the OSF to collaborate, document, archive, share, and register research projects, materials, and data.

XSEDE Bridges computing and storage

XSEDE national infrastructure facility hosted at the Pittsburgh (PA) Supercomputer Center. Campus XSEDE champion is Aaron Culich (as of 2016). XSEDE offers free computing and storage to qualified researchers through a competitive application process.

XSEDE Storage Services

XSEDE is a set of national facilities that scientists can use to interactively share computing resources, data and expertise. People around the world use these resources and services — things like supercomputers, collections of data and new tools — to improve our planet. XSEDE resources include several services for storing research data.

References

https://guides.library.yale.edu/rdm_healthsci/home

https://pitt.libguides.com/managedata/understanding#s-lg-box-4890536

https://data.research.cornell.edu/content/readme

https://ukdataservice.ac.uk/deposit-data/preparing-data.aspx

https://dmptool.org/

https://www.dataone.org/

https://datadryad.org/stash

Questions?

Thank you

Online resources

Pre-submitted questions

Qualitative Data Management

• Create a data dictionary that contains:

• Dates

• Locations

• Individual or group characteristics

• Interview characteristics

• Other defining features

• Ensure fidelity of analyzed data

• Ethics requirements

• Version control

Mack N, Woodsong C, MacQueen KM, Guest G, Namey E. Qualitative research methods: a data collectors field guide.

Managing Geospatial Data

Review Matrix (for literature/references)

• Review Matrix: Using a spreadsheet/table to organize key elements

https://guides.library.vcu.edu/health-sciences-lit-review/organize