Data Management for Research
description
Transcript of Data Management for Research
Data Management for Research
Aaron Collie, MSU LibrariesLisa Schmidt, University Archives
Introductions Please tell us your name and
department A brief description of your primary
research area What do you consider to be your
research data?
Optional: Experience managing research data? Experience writing a data
management plan?
cc http://www.flickr.com/photos/quinnanya/
• Introductions • Background• Definitions• Upfront Decisions• Data Sharing Impacts
• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup
• Data Lifecycle Strategy
Agenda
Why are we here?
But why are we really here?
An Impetus: NSF recently released a mandate that all grant applications submitted after January 18th, 2011 must include a supplemental “Data Management Plan”
An Effect: This mandate from NSF has had a domino effect, and many funders that now require or state guidelines for data management of grant funded research
A Challenge: Data management (and oftentimes research methods in general) is an area that has not traditionally received a full treatment in most graduate and doctoral curricula
What is meant by “data management”?
Fundamental Practices File Organization Data Documentation Reliable Backups
Data lifecycle Digital Sustainability Scholarly
Communication Data Publishing Research Impact
Effective January 18, 2011 NSF will not evaluate any proposal missing a DMP May be up to two pages long PI may state that project will not generate data or
samples DMP is reviewed as part of intellectual merit or
broader impacts of application, or both Costs to implement DMP may be included in
proposal’s budget
NSF’s Data Management Guidelines Policies for re-use, re-distribution, and creation of
derivatives Plans for archiving data, samples, and other research
outcomes, maintaining access Types of data, samples, physical collections, software
generated Standards for data and metadata format and content Access and sharing policies, with stipulations for
privacy, confidentiality, security, intellectual property, or other rights or requirements
Other Federal Policies
NASA “promotes the full and open sharing of all data”
“requires that data…be submitted to and archived by designated national data centers.”
“expects the timely release and sharing of final research data"
"IMLS encourages sharing of research data."
“…should describe how the project team will manage and disseminate data generated by the project”
Upfront Decisions for Researchers What is the expected lifespan of the data? Besides the researcher(s) on the project, who else
should be given access to the data? Does the dataset include any sensitive information? Who owns or controls the research data? Should any restrictions be placed on the dataset? How are the data stored and preserved?
Upfront Decisions for Researchers How might the data be used, reused, and
repurposed? How is the data described and organized? Who are the expected and potential audiences for
the datasets? What publications or discoveries have resulted from
the datasets? How should the data be made accessible?
Data Sharing Impacts Reinforces open scientific
inquiry Encourages diversity of
analysis and opinion Promotes new research,
testing of new or alternative hypotheses and methods of analysis
Supports studies on data collection methods and measurement
Cc http://www.flickr.com/photos/pinchof_10/
Data Sharing Impacts (cont.) Facilitates education of
new researchers Enables exploration of
topics not envisioned by initial investigators
Permits creation of new datasets by combining data from multiple sources
• Introductions • Background• Definitions• Upfront Decisions• Data Sharing Impacts
• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup
• Data Lifecycle Strategy
Agenda
File Organization Practices: Overview
1. Create a file plan for your research project
2. Design a file naming convention that works for your project
3. Agree on a version control method to assist with file synchronization
4. Carefully choose file formats to maximize usefulness
“When I was a freshmen I named my assignments Paper Paperr Paperrr Paperrrr”-Undergrad
1. Create a file plan for your research project
File plan as a classification system Indexed – makes it easier to locate folders/files Primary subjects – main functions of research project
Secondary subjects – more specific activities of project, including research data
• Tertiary subjects – limit by date or equivalent– File Name (naming conventions)
1. Create a file plan for your research project (cont.)
Example documentation of Directory Hierarchy: /[Project]/[Grant Number]/[Event]/[Date]
Example documentation of File Naming Convention: [investigator]_[method]_[descriptor]_[YYYYMMDD]_[version].[ext]
2. Design a file naming convention that works for your project
Why file naming conventions? Enable better access/retrieval of files Create logical sequences for file sorting More easily identify what you’re searching for
Meaningful but short (255 character limit) Descriptive while still making sense Capital letters or underscores differentiate
between words Surname first followed by initials of first name More on handout
2. Design a file naming convention that works for your project (cont.)
2. Design a file naming convention that works for your project (cont.)
This Not ThissharpeW_krillMicrograph_backscatter3_20110117.tif KrillData2011.tif
This Not ThisborgesJ_collocation_20080414.xml Borges_Textbase.xml
3. Agree on a version control method to assist with file synchronization
Version number of record indicated file name with “v” followed by version number
Letter “d” indicates draft
Examples of simple version control:waltM_lakeLansing_fieldNotes_20091012_v002.docpetersK_OrgChart2009_d001.svg
4. Carefully choose file formats to maximize usefulness
• Non-proprietary• Open, documented standard• Common usage by research community• Standard representation (ASCII, Unicode)• Unencrypted• Uncompressed
Documentation Practices: Overview
1. At minimum create a README file that you can use to document your project
2. Utilize standards for describing data including Metadata Standards
3. If applicable, use in-line code commentary to explain code (cc) Will Scullin
1. At minimum create a README file that you can use to document your project
At minimum, store documentation in readme.txt file or equivalent, with data
Resource: http://libraries.mit.edu/guides/subjects/data-management/metadata.html
“Data about data” Standardized way of describing data Explains who, what, where, when of data creation
and methods of use Provides the essential tools for discovery, such as
a bibliographic citation
2. Utilize standards for describing data including Metadata Standards
2. Utilize standards for describing data including Metadata Standards
Basic project metadata:
• Title • Language • File Formats• Creator • Dates • File Structure• Identifier • Location • Variable List• Subject • Methodology • Code Lists
• Funders • Data Processing • Versions• Rights • Sources • Checksums• Access
Information• List of File Names
Documentation Practices: Example Metadata Standards
Dublin Core Easy-to-create-and-maintain descriptive format to facilitate cross-domain resource discovery on the Web
Darwin Core Facilitates reference and sharing of biological diversity datasets
Data Documentation Initiative (DDI) Methodology for content, presentation, transport, and preservation of metadata about datasets in the social and behavioral sciences
Documentation Practices: Example Metadata Standards
Directory Interchange Format Descriptive format for exchanging information about earth science data
ISO 19115:2003 Describes geographic data such as maps and charts
PBCore Supports description and exchange of media assets, including both individual clips and full, edited, aired productions
Documentation Practices: Example Metadata Standards
Science Data Literacy Project Metadata for astronomy, biology, ecology and oceanography
VRACoreData standard for description of works of visual culture as well as images that document them
3. If applicable, use in-line code commentary to explain code
Example of R code commentary
# Cumulative normal densitypnorm(c(-1.96,0,1.96))
Backup Practices: Overview
1. Avoid single points of failure2. Understand the different types of storage3. Ensure data redundancy4. Aim for geographic distribution of data
1. Avoid single points of failure
A single point of failure occurs when it would only take one event to destroy all data on a device (e.g. dropped hard drive)
Good practices for avoiding single points of error: Use managed networked storage whenever possible Move data off of portable media Never rely on one copy of data Do not rely on CD or DVD copies to be readable Be wary of software lifespans (e.g. Angel)
2. Understand the different types of storage
• Flash Drives• Internal Hard Drives• External Hard Drives• Server and Web Storage• Managed Networked Storage• Cloud Storage
3. Ensure data redundancy
Backup Do’s: Make 3 copies
E.g. original + external/local + external/remote E.g. original + 2 formats on 2 drives in 2 locations
Geographically distribute and secure Local vs. remote, depending on needed recovery time
Personal computer, external hard drives, departmental, or university servers may be used
3. Ensure data redundancy (cont.)
Backup Don’ts: Do not rely on one copy Do not use CDs and DVDs Do not rely on ANGEL
(cc) George Ornbo
3. Ensure data redundancy (cont.)
Backup Maybe: Cloud storage
Amazon s3 Google MS Azure DuraCloud Rackspace
Note that many enterprise cloud storage services include a charge for in/out of data transfers
$$$
• Introductions • Background• Definitions• Upfront Decisions• Data Sharing Impacts
• Fundamentals Practices• File Organization• Data Documentation• Reliable Backup
• Data Lifecycle Strategy
Agenda
Research is…De
fine
a qu
estio
n
Gath
er
info
rmati
on
Form
a hy
poth
esis
Test
the
hypo
thes
is
Anal
yze
the
data Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a
hypo
thes
is
Test
the
hypo
thes
is
Anal
yze
the
data
Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
?
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a
hypo
thes
is
Test
the
hypo
thes
is
Anal
yze
the
data
Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
The scientific method “is often misrepresented as a fixed sequence of steps,” rather than being seen for what it truly is, “a highly variable and creative process” (AAAS 2000:18).
Gauch, Hugh G. Scientific Method in Practice. New York: Cambridge University Press, 2010. Print. (Emphasis added)
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a hy
poth
esis
Test
the
hypo
thes
is
Anal
yze
the
data Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
The Research Depth Chart
Scientific Method
Research Design
Research Method
Research Tasks Mor
e Sp
ecifi
c
M
ore
Gene
ric
Defin
e a
ques
tion
Gath
er
info
rmati
on
Form
a hy
poth
esis
Test
the
hypo
thes
is
Anal
yze
the
data Inte
rpre
t th
e da
ta
Publ
ish
resu
lts
Rete
st
Source: DDI Structural Reform Group. “Overview of the DDI Version 3.0 Conceptual Model.“ DDI Alliance. 2004.http://opendatafoundation.org/ddi/srg/Papers/DDIModel_v_4.pdf
The Data Management Depth ChartResearch Data Lifecycle Model
The Data Management Depth ChartResearch Data Lifecycle Model
Research Data Management Tasks
???
???
The Data Management Depth ChartResearch Data Lifecycle Model
???
Data Management Plan
Research Data Management Tasks
http://www.lib.msu.edu/about/diginfo/ldmp.jsp
Data are brainstormed
Study Concept
Data are brainstormed
DMP • Data type, purpose & value
MSU
• University Research Council guidelines• Research Facilitation and
Dissemination• Lifecycle Data Management Planning• Research Data Management Guidance
YOU • Start your Data Management Plan!
Data are collected and secured
Study Concept
Data Collection
Data are collected
DMP • Data format, size & short term storage
MSU
• ATS Andrew File System (AFS)• Institute for Cyber Enabled Research• MSU Libraries Data Services• MSU Libraries Campus Data Resources
YOU • File Plan, File Naming, Backup Plan
Data are normalized and processed
Study Concept
Data Collection
Data Processing
Data are processed
DMP • Data transformations & structures
MSU• LCT Computing Courses• High Performance Computing Center• Consortium of Research Consulting
Services
YOU • Documentation, Methodology
Data are distributed
Data Distribution
Study Concept
Data Collection
Data Processing
Data are distributed
DMP • Data sharing, security & rights
MSU
• Human Research Protection Program• University Research Council guidelines• MSU Libraries Copyright Permissions
Center• MSU Google Apps
YOU • Roles, Responsibilities, Resources
Data are discoverable
Data Distribution
Study Concept
Data Collection
Data Processing
Data Discovery
Data are discoverable
DMP • Data publishing & metadata
MSU• Development of Copyrighted Materials• MSU Libraries Data Citation Guide
YOU • README, Metadata Standard
Data are analyzed
Data Distributio
n
Data Discovery
Data Analysis
Study Concept
Data Collection
Data Processing
Data are analyzed
DMP • Standards & workflow documentation
MSU• Center for Statistical Training and
Consulting• Statistical Consulting Services
YOU • Code Commentary, Documentation
Data are stored and preserved
Data Distribution
Data Discovery
Data Analysis
Study Concept
Data Collection
Data Processing
Data Archiving
Data are preserved
DMP • Long term storage & management
MSU• VPRGS Repositories and Archives• Lifecycle Data Management Planning• Databib.org!
YOU • Embrace stewardship
Data can be used and reused
Data Distribution
Data Discovery
Data Analysis
Study Concept
Data Collection
Data Processing
Data Archiving
Repurposing
Data can be used and reused
DMP • Broader impact
MSU• Research Data Management CAFE• MSU Research Centers and Institutes• MSU Libraries Data Citation Guide
YOU • Publish your data!
Research Data Management Guidance
Face-to-face Advising Writing Data Management Plans Planning for Digital Projects Managing Digital Information
Group Training New Faculty Orientation Faculty Seminars Classroom Instruction lib.msu.edu/about/rdmg
In Conclusion… Upfront Decisions Researchers Need to Make General Good Practices for Managing Research Data NSF, NIH, IMLS and Other Funders’ Requirements Lifecycle of Research Data
ContactLisa M. SchmidtElectronic Records ArchivistUniversity Archives & Historical [email protected]
Aaron CollieDigital Curation LibrarianMSU [email protected]