Successful Digital Data Management and Data...

31
Successful Digital Data Management and Data Management Plans Dee Ann Allison Address: 317 Love Library Telephone: 402-472-3944 Email: [email protected] Leslie Delserone Address: 210 C.Y. Thompson Library Telephone: 402-472-6297 Email: [email protected]

Transcript of Successful Digital Data Management and Data...

Successful Digital Data Management and Data

Management Plans Dee Ann Allison

Address: 317 Love Library Telephone: 402-472-3944 Email: [email protected]

Leslie Delserone

Address: 210 C.Y. Thompson Library Telephone: 402-472-6297

Email: [email protected]

Introduction to Data Management

– Data management is required for responsible research.

– Data management starts with the beginning of a project and is followed through the life of the project.

– When a project is completed the data should be stored in such a manner that it can be retrieved by the researcher or shared with a broader community (depending on the data management plan)

What is going to be collected? What are the formats? How will it be stored and protected? What will I do when the project is completed?

So how long do researchers have to keep records?

We had that question while reading the retraction notice for a 2007 paper published in the Journal of Clinical Investigation: “It has come to our attention that some of the panels in Figure 5 appear to be duplications of each other. The authors are unable to provide the original source files that were used to generate these data. In the absence of the original data that verify the integrity of the images, the JCI Editorial Board has decided to withdraw Figure 5 from the scientific record. No issues have been raised in regard to any of the other data in this manuscript. The authors concur with this course of action.” • Source: http://retractionwatch.wordpress.com/2013/07/19/jci-

paper-retracted-for-duplicated-panels-after-authors-cant-provide-original-data/

Introduction to Data Management

Many federal grantors now require a Data Management Plan (DMP) as part of the proposal. You don’t want your proposal to be refused/rejected because you didn’t do an adequate job with the DMP. While preparing your proposal, check the funding agency’s requirements more than once because the requirements may be updated before the submission deadline.

NSF-BIO’s Requirements

What is a DMP? Data management plans typically focus on the data management process, during and after the grant period.

• Describe the data to be collected (including formats, data and metadata standards).

• Describe how the data will be stored and preserved . • Describe the dissemination methods planned for the data and

metadata. • Describe policies for data sharing and public access (privacy,

security). • Describe roles/responsibilities of all parties involved in data

management.

Source: http://www.nsf.gov/bio/pubs/BIODMP061511.pdf

These descriptions will depend on the field of research and nature of the study.

File Formats and Transformations

• Data exist in a variety of formats. – Ideally a format should be nonproprietary.

• It is important to consider data normalization so data can be interpreted correctly.

• File compression can be useful but you must understand the consequences.

• De-identifying data is critical.

Formats Select a format(s) that will be most usable for the life cycle of the project. The PRONOM registry (http://www.nationalarchives.gov.uk/PRONOM/Default.aspx#) is a useful site about formats. Domain registries may have required file formats for deposits. Librarians can answer your questions and make suggestions.

A file format is a way of encoding information within a digital file. A program or application must recognize the file format in order to access the file.

• Consider the following when deciding on a format that will work for your immediate needs and can be sustained for the long-term Verify that the technical specification for the format is available. Use a well-established format that is in wide use. Select a stable format that isn’t undergoing rapid change. Formats that support metadata are a plus because they can provide

provenance and technical characteristics. Avoid formats with built-in features and functionality that can become

obsolete and/or lose functionality when they are migrated. Select formats with error detection to keep your data from becoming

corrupted. For example, PNG uses byte sequences to check for errors. Text files can be human readable and use a character encoding

standard, while binary files can only be read by software that handles that format.

Format challenges Potential risks for loss or corruption on conversion or migration to

new media: Word files: fonts, text formatting, headers, footers, footnotes, links to other documents, any embedded functionality. Numeric files: special characters (such as quotation marks), end of line returns, last characters in rows (due to row size limitations), last rows (due to row number limitations), formulas. Database files: as above but also relationships between tables. Multimedia files: artifacts or distortion, loss of layers, color, resolution, and changes in codecs. File sizes may change and even become much larger.

Normalization • Interoperability requires normalized data. • Normalized data are data that conform to

common characteristics that can be easily re-used by different users and/or understood by different applications. Some TIPS

• Agree on scale and other characteristics that support comparisons and interpretation.

• Non-numeric qualitative data should be converted to number quantitative data that is standardized. This is called statistical normalization (where a formula is used to transform values into numbers that can be better compared).

• Database normalization is a set of design rules that eliminate duplication and inconsistency.

File Compression

• File compression saves space on storage devices and reduces transmission time when files are transferred over networks.

• However, there are some issues with compression: – Compressed files are subject to corruption and are more

difficult to recover. – Some files don’t save much space when compressed (e.g.,

multimedia files), while text files can save much space (up to 90%).

– Avoid “lossy” file compression types. – Compressed files can take a long time to decompress, so

best used for files that are no longer in active use.

Privacy & Confidentiality

• Privacy – individual is known/identified • Privacy concern – what data may be collected

&/or shared about this individual? • Confidentiality – identity of individual is not to

be disclosed • Confidentiality concern – are there data that

identify, or could be used to identify, a research participant ?

De-identification of data • Anonymization is the removal of any identifiers, such as name

and address, that could be used to link data with an individual

• Replace sensitive information with machine generated codes.

• Aggregate the data into related categories to provide analysis at a higher level than the level of the data collection. Census data is often aggregated.

• Perturbation: “Techniques for the release of data that change the data before the dissemination in such a way that the disclosure risk for the confidential data is decreased but the information content is retained as far as possible.” (source: http://stats.oecd.org/glossary/detail.asp?ID=6950 )

Documentation and Metadata

• Documentation – careful information-gathering on the who, what, where, when, and why of your research

• Documentation – often an idiosyncratic, human- powered process

• Documentation formats – paper field or laboratory notebooks, digital photos, field laptop Word documents

Documentation and Metadata

• Metadata – “data about data” • Metadata – often standards-based • Metadata – describes the research data itself,

but also captures other information (administrative, technical)

• Metadata – generated manually &/or automatically

Importance

• Required by funding agencies as part of your data management plan

• Establishes provenance for original research data

• Accurately describes the research data and the context of its collection and analysis

• Supports future re-use of the data; may enhance reproducibility

Application to Your Research

• What are the relevant data and metadata standards in your area of research/discipline?

• How will you capture the metadata (manually, automatically, both)?

• How will you merge documentation with metadata?

• Do you have a data dictionary, controlled vocabulary, or code book that should be shared?

Assistance

• Professional societies • University Libraries Data Curation Task Force • Librarian assigned to your department or

program

Storage & Security • Keeping data safe and secure is your responsibility. • Follow the policies and practices set-up by your college and the University,

as well as by the funding agency. • Information Technology Services can assist with security planning. • Never accept or store Social Security Numbers (http://is.unl.edu/ssn).

Effective January 1, 2007, Social Security Numbers (SSNs) - including any portion of the full nine digits-shall not be electronically collected, transmitted, or stored via University-sponsored services or using University-owned computing equipment, information systems or networks unless specifically authorized in writing by officials designated by the Chancellor. Individuals or departments that collect, transmit or store SSNs will take steps necessary to secure this data using best practices identified by the Chief Information Officer. Failure to comply with this policy after January 1, 2007 may result in disciplinary actions taken by the University.

Questions

• Who will be responsible for securing sensitive data?

• Are there privacy issues? • Are there data relating to human subjects? • What are the policies/procedures/ethical

requirements regarding these data? • How will privacy &/or confidentiality

requirements be enforced? By whom?

Keys to data security

• Know where your data are located (secure) and who has access. • Know the number of copies and where they are stored. • Safeguard all local copies (laptops, USB drives, etc.). • Sanitize drives you no longer need to ensure data are not recoverable. • Use encryption and firewalls when appropriate. • Understand your rights when you deposit in a domain depository. • Back-up your data. • Never share passwords or leave them out for someone to see • Keep your virus protection up to date. • Take precautions when you send data via e-mail. • Ensure that remote access (off campus) connections are done securely

using SSH or VPN. • Use a centrally managed server when possible.

Data Rights and Access

• Are any of the data copyrighted? Who holds copyright?

• Third-party ownership of the data? • Licensing/repository – policy on re-use of the

data, its citation, production of derivatives?

• Metadata requirements to ensure accessibility?

Data Rights and Access

• What are the funder’s expectations regarding data dissemination and sharing? Are the data subject to open-access requirements? FOIA requests?

• Licensing (Creative Commons, Open Knowledge Foundation)

• Access to data and metadata through trusted repositories (e.g., DRYAD, ICPSR)

Domain Registries/Repositories • There are a variety of registries where you can store your files.

http://oad.simmons.edu/oadwiki/Data_repositories 1 Archaeology 2 Astronomy 3 Biology 4 Chemistry 5 Computer Science 6 Energy 7 Environmental sciences 8 Geology 9 Geosciences and geospatial data 10 Linguistics 11 Marine sciences 12 Medicine 13 Multidisciplinary repositories 14 Physics 15 Social sciences

More Registries/Repositories • DataCite – to find, access, and reuse data. Has a listing: DataBIB

(http://www.datacite.org/repolist), which is a tool for helping people identify and locate online repositories of research data. Users and bibliographers create and curate records that describe data repositories that users can search.

• Digging into Data -http://www.diggingintodata.org/Home/Repositories/tabid/167/Default.aspx has a listing of digital libraries, data archives and repositories.

• re3data.org registry -http://service.re3data.org/search/results?term covers research data repositories from all academic disciplines.

• UNL Data Registry -https://dataregistry.unl.edu/ where you can deposit and register your data, even when it isn’t stored in our registry, to improve its discovery.

Example-Life Sciences

• Biodiversity research (not of at-risk species) – physical samples, digital data

• Data organized in Excel spreadsheets • Pictures annotated in Photoshop • Stored in multiple locations and in both paper

and digital formats

Life Sciences DMP DMP Requirements • Data description, metadata,

formats

• Storage, preservation, dissemination, retention

• Sharing

• Responsibilities

DMP Descriptions • Soil/plant samples, animal

specimens, digital data (measurements, pictures)

• Metadata: DARWIN CORE • Excel to .csv; JPEGs • .csv batchload into open-source

database, JPEGs included; basic security (password-protection, variable access to database )

• UNL Data Registry; DRYAD; Global Biodiversity Information Facility

• Long-term retention of all digital data & specimens; limited retention of physical samples

• Licensing considerations, so that attribution is assured

• Data retained within program; backup & data integrity checks assigned to specific personnel

Example-Social Sciences

• Educational research – college students • Personal data involved (e.g., names, grades) • Survey data, testing data, personal interviews • Data organization will involve coding & de-

identification

Social Sciences DMP DMP Requirements DMP Descriptions • Data description, metadata,

formats

• Storage, preservation, dissemination, retention

• Sharing

• Responsibilities

• Interview questions and answers (print and audio), names, grades, test scores

• Metadata: Data Documentation Inititative (DDI)

• Excel to .csv; .mp3 files • Additional security (print, locked

cabinet), hard drives with additional password protection and encryption

• UNL Data Registry • Limited retention; data not useful

after 5 years • Data retained within program;

backup & data integrity checks assigned to specific personnel; limited personnel access

Additional Resources

• Subject librarians can assist with data management plans given some advance notice. There is a list of librarians at http://libraries.unl.edu/subjspecialists

• There is a guide to data management at http://unl.libguides.com/datamanagement

• The Libraries’ website about data management includes a link to the UNL Data Registry at http://libraries.unl.edu/data-management