Data Management and Data Management Planning · Data Management and Data Management Planning Boston...
Transcript of Data Management and Data Management Planning · Data Management and Data Management Planning Boston...
Data Management and Data Management Planning
Boston College OSP Briefing -- Nov. 21, 2017
Enid Karr, Sr. Bibliographer for Biology, Earth & Environmental Sciences, Environmental Studies [email protected]
Barbara Mento, Data/GIS Librarian, Sr. Bibliographer for Computer Science, Economics, Mathematics [email protected]
Sally Wyman, Sr. Bibliographer for Chemistry, Physics, Environmental Studies
Fits into “responsible conduct of research” compliance
Risk of data loss to researcher and to University
Facilitates fulfillment of requests from others to see individual researcher data
Preserves understanding of details for later
Shared data (“open access”) higher citation rate!
Why Have a Data Management Plan (“DMP”)?
NSF – all grants since Jan. 18, 2011 require 2-page DMP
Directorate of Mathematics and Physical Sciences, Directorate of Chemistry (CHE)
NIH – all grants since Oct. 1, 2003 require data sharing
DOE – all grants since Oct. 1, 2014 require DMP that describes how sharing/preservation will enable validation of results (or statement that describes how these goals will be met without sharing/preservation)
Many more agencies now require some level of sharing or DMP -- since Feb. 22, 2013 OSTP memorandum
Most federal grants now require this:
More scholarly publishers do, too, now:
American Economic Association journals, Nature, Science, PNAS, PLoS, etc. require or encourage that data be clearly documented
available for sharing
detailed enough to permit replication of analysis
“Data” journals are now published Nature Publishing Group’s Scientific Data provides a place for curated descriptions of scientifically valuable datasets to foster sharing and re-use of data.
There are indexes that focus on data … Data Citation Index
Many grants now require this, but it’s bigger than that:
What is “data”? From the NSF FAQ on
Data Management Plans: “DMP” covers recorded factual material commonly accepted in the [specific] scientific community as necessary to validate research findings. May include, but is not limited to:
Data
Publications
Samples
Physical collections
Software and models
But not: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. (Office of Management and Budget (OMB) Circular A-110 )
Elements of a “Typical” Data Management Plan (DMP)
1-2 pages describing the project and how data will be:
Collected (including formats, size, etc.) … Secured … Analyzed … Shared … Preserved
Details about access/sharing
Potential audience(s) for the data
How access will be provided and how others will find it: “Access” (freely-available) vs. “Sharing” (by request)
Stipulations for privacy, confidentiality, IP or other rights
Allowed re-use of the data, derivative products
Metadata standards to be used
How long data will be retained -- archiving, long-term preservation and format migration
Boston College Libraries Data Management Plan
Research Guide
Guidance on content
Templates/examples
Additional resources
DMPTool
Dataverse
(More later in session)
To arrange a consultation with a subject specialist
https://libguides.bc.edu/dataplan
Key DMP Concepts
1. A “Read me” file or Code Book
2. Use of “open” (non-proprietary) file formats
3. Consistent naming practices for all files
4. Metadata
5. Back up plan
6. Long-term storage strategy
7. Data sharing
8. Plan for true “archiving”
Key DMP Concept #1
A “Readme” file or “Code Book”
This file (or document) describes the research process for collecting data, how it is stored, how it is backed-up and file
formats chosen … and more, as described, below.
Key DMP Concept #2
Use of “open” file formats, avoiding proprietary formats
Whenever possible, researchers should save data using open standards. Some examples:
TXT, PDF/PDF Archival, not Word (doc, docx)
ASCII, not Excel (xls, xlsx)
MPEG-4, not Quicktime (qtff)
TIFF or JPEG2000, not GIF or JPG
XML or RDF, not RDBMS
Ideally, files are saved in both original format AND one of the preferred ones listed above.
Key DMP Concept #3
Consistent naming practices for all files
File names should be brief and unique, and might contain:
Project acronyms, research initials, file type information, version, date, file status, like this one:
Internet Usage Study version 2, Sept. 2011, final draft, in csv format: IUS_v02_092011_final.csv
Evidence of maintenance of both archival (unmodified) and updated “versioned” files (clearly labelled)
Organization of Files
Directory Structure
Use folders!
Possible ways to organize:
By types of data
fMRI, interview, video
By experiment/study
By collection method
Choose option that works best for your research group … it should be understandable to others
Image: digitalart / FreeDigitalPhotos.net
Version Control of Files
Keep an archival (unmodified) version, and updated versions (clearly labelled)
Use ordinal numbers (1, 2, 3) for major changes and decimals for minor changes (V1.1, V1.2 …)
Version control software can help, and some software has this built-in… especially instrument software
Key DMP Concept #4
Metadata … It’s “data about the research data”
This “data” (subject-based terminology) helps others discover your data (more about this, shortly), helps YOU remember, and may be needed for later journal
publication/data deposit…
Metadata standards exist
per discipline
per type of data (.cif for crystallographic data, for example)
Per individual repository (ICPSR, GenBank, etc.)
Metadata is recorded in the “readme” file or code book
Data Documentation (“Metadata”)
Metadata captures the most critical information about a particular project. Best when captured early on… helps jog memories later …
It helps others discover the research being shared.
Metadata may be required for journal publication/data deposit.
For help, contact your
Subject Librarian
ISO suggested Minimum Data Elements o Title o Creator (Principal Investigators) o Date Created (also versions) o Instrument and model o Format (and software required) o Subject o Unique Identifier o Description of the specific data resource o Coverage of the data (spatial or temporal) o Publishing Organization o Type of Resource o Rights o Funding or Grant
Data Documentation – What do you do with it once you have it?
Record it in a readme.txt file
In some fields, “codebooks” are used to record methodology and other data management notes (e.g. IRB compliance statements, etc.)
Consider including a “data dictionary”
Inserted with deposited data these files facilitate “discovery” of your data on the Web
Key DMP Concept #5
Back up Strategy
Regular, scheduled back up protects against loss
Back up strategy will depend on your needs, collection volume, update frequency, etc.:
Back up all versions of the files or certain ones?
How often to back up files?
Listing at least two back up locations (so, 3 copies)
Internal (researcher computer)
External (i.e. the BC Research Data Archive or departmental servers)
Assign responsibility for backing up
Key DMP Concept #6
Long-term Storage Strategy
Plan should describe how data will be stored … in the safest long-term locations (not a laptop or flash-drive!)
Local (lab computer, flash-drive): convenient, but much less secure
Centralized – ITS Servers, departmental servers, BC ELN server
Remote – BC Dataverse, disciplinary servers (GenBank, ICPSR, etc.) most tailored to disciplinary needs, but may be open (and that may be problematic for some researchers)
Grants can sometimes cover cost of long-term storage.
Key DMP Concept #7
Data Sharing – … the ultimate goal of DMPs Options include: personal website … but researchers can do better:
Journal “supplementary materials” (ACS, etc.) … now in figshare
Institutional repository, e.g. eScholarship@bc, BC Dataverse
Disciplinary (or multidisciplinary) repository e.g. Cambridge Structure Database
Or, a combination: journal-designated repository ( Nature or Review of Economics and Statistics Dataverse, for two examples)
Examples of Subject Repositories
Biomedicine:
GenBank* -- sequence data
RSCB Protein DataBank* -- biomolecule crystal structure coordinates, etc.
Chemistry:
Cambridge Structural Database (CSD)*
PubChem (Part of NCBI Entrez, covering biological activities of small molecules)
Social Sciences
ICPSR (Inter-university Consortium for Political and Social Research)
IQSS (The Institute for Quantitative Social Science)
Multidisciplinary: figshare.com (Open, Free)
Finding Aid: https://www.re3data.org/ -- search for repositories in all disciplines
*A few of the data repositories that fulfill Science magazine requirements for data deposition
Key DMP Concept #8
Archiving Plans
Archiving data means not just preserving the data in the original format but also in a format that is non-platform reliant, using a standard that ensures that the data can be re-used in the future.
Additional Essential Elements in the DMP:
Ethics/privacy
Data Ownership
Intellectual Property/Technology Transfer
Ethics and Privacy Sensitive data should be redacted
before depositing in a public archive or repository.
Access to data may be embargoed (access limited for a time) for confidentiality, legal, patentability or other reasons.
Dark archives ensure permanent protection of confidentiality.
Where human subjects/privacy is involved, BC’s Institutional Review Board (IRB) must approve. https://www.bc.edu/research/oric/human.html
Image: digitalart / FreeDigitalPhotos.net
Good DMP, then What Happens? Data is Shared, then … Cited
Data Citation
Why is proper data citation important?
Ensures that original producers of the data are credited in citation indexes*
Allows researchers to locate research data used in an article
May be required by the archive that stores the data to be repurposed
*Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308.doi:10.1371/journal.pone.0000308
Citing Data Sets Essential citation elements; style will vary:
• author or creator
• title or description
• year of publication
• publisher and/or the database/archive from which it was retrieved
• the URL or DOI if the data set is online
Hitchcock, Colleen; Manning, Deirdre; Keegan, Kevin; Utsun, Ekin, 2016, "Boston College tree inventory data archive", https://doi.org/doi:10.7910/DVN/IBSB2R, Harvard Dataverse, V1
Pitt-Catsouphes, Marcie, and Steven Sweet. Talent Management Study: U.S. Workplaces In Today's Business Environment, 2009. ICPSR34836-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2013-09-09. http://doi.org/10.3886/ICPSR34836.v1
https://doi.org/doi:10.7910/DVN/IBSB2R
Library Tools and Resources to Help
Barbara, Enid and Sally! We are here to help.
Research Guide for Data Management Plans https://libguides.bc.edu/dataplan
The BC DMP Tool http://www.bc.edu/sites/libraries/dmptool/
Dataverse https://libguides.bc.edu/dataverse
E-Scholarship@bc https://dlib.bc.edu/
ORCID at BC – unique researcher identifier https://bc.edu/orcid
Dataverse
Map created with data deposited in Dataverse.
Visualization of inventories, species and health of trees to
determine BC carbon footprint in 2010.
Data is being repurposed in BC Ecology and Evolution class
From presentation IRODS Meeting, “ Dataverse DataTags: Sharing Data You Can’t Share.” Merce Crosas, Ph.D. Director of Data Science, IQSS, Harvard University, 2014
Dataverse version 5 will include DataTags for confidential data. Coming December 2017
eScholarship@BC is our institutional repository
• Faculty can deposit scholarly work including – Working papers, published articles, teaching materials,
conference presentations, posters • Reasons to deposit:
– Compliance with funder mandates for open access – Global visibility and readership – Search engine harvesting – Eliminates economic barriers to knowledge – Increase citation counts – Get a permanent URL for the CV – Long-term preservation
• Link your data to accompanying publications
Benefits ✓ Improves visibility in
field ✓Self-updating CV ✓Eliminates name
ambiguity
What is ORCID ✓Unique, persistent
identifier for researchers & scholars
✓Follows you wherever you go
✓Constant through life events and name changes
Less than 60 seconds to sign up bc.edu/orcid
Faculty Annual Report
eScholarship @BC
Faculty Profiles
Manuscript submission
Grant applications Professional
Orgs
bc.edu/orcid
ORCID at BC ✓ Import citations to
Faculty Annual Report ✓Display on work in
eScholarship@BC ✓Display on faculty
profiles ✓Use in grant and
manuscript submissions
More Ways We Can Help: Librarians are expert searchers
Need information on the broader impacts or applications of the proposed work? May call for a more comprehensive literature search -- particularly into literature outside of the immediate field – ask your subject specialist librarian. Who else is doing similar research? The Scopus database can make this more visible. See next slide.
Who is funding similar work? Scopus and Web of Science both report funder sources and both allow searching by funder to find funded work.
Where to publish/how to best disseminate the work? JCR Database – use it to find journal Impact Factors (IF) Scopus – use to find CiteScore, SJR, SNIP – alternatives to IF