LIS 653, Session 11: Data Management & Curation

Post on 28-Jun-2015

72 views 1 download

Tags:

description

An introduction to data management, data curation, and librarian roles related to data.

Transcript of LIS 653, Session 11: Data Management & Curation

Data Management

LIS 653

Starr Hoffman

Data

What is (are) Data? Observations

Sensor data, telemetry, survey data, sample data Experiments

Gene sequences, chromatograms Simulations

Economic models Derivations/Compilations

Text mining, data from public documents Documents & texts themselves = data

Research Process Observational conditions, experimental procedure,

instrumentation, label descriptions, units, metadata

Librarian Roles & DataAdvisory

Original data: Consult on creating DMP Consult on data organization, methodology, etc. Consult on metadata practices Consult on archiving Help disseminate research

Journal publication, OA resources, blogs, etc. Deposit into repository (IR, 3rd party, etc.)

Secondary data: Consult on methodology / analysis Discovery…

Curatorial Manage IR (institutional repository) Create metadata for datasets Purchase / catalog / discovery for secondary data

What is Data Management?

Planning for the short-term and long-term: care of and access to

…your data.

Or: What are you going to do with that data?

How will you describe it? How are you organizing it? After you’re done, where will you put it? How will you/others be able to access it? For how long?

Data Management: Why Does it Matter?

Grant requirements Public access to funded research

Validation Replication Re-use, continue research Teaching

Natural disasters Computer failure/stolen

USB/hard drive failure/lost Files corrupted

Funding Requirements

NSF: Proposals must include a supplementary document of no more than two pages labeled “Data Management Plan” …describe how the proposal will conform to NSF policy on the dissemination and sharing of research results.

NIH: The NIH expects and supports the timely release and sharing of final research data… for use by other researchers. …expected to include a plan for data sharing or state why data sharing is not possible.

NEH Office of Digital Humanities NOAA IMLS NIJ

DMP Considerations What data types, from what sources, in what formats will this

project produce? How much of it will there be?

How will you describe or document your data? Are there standards you will be using for this?

Will you be sharing your data? Do you have the rights to share the data? What did you tell the IRB?

How often do you need to backup your files? How do you need to be able to access your files? How many backups will you have?

How much storage space do you need? What is your budget for your storage?

Where are you going to archive or store the data? and how will it be accessed?

What are the roles and responsibilities around all of these things? i.e., Who's going to be doing all this?

DMP Examples

Planning the Data Life-Cycle

Consider…

Files: Size, format, organization

Security Storage/Backup system Retention Access/Transparency

Data Lifecycle:Create / Analyze / Edit

File Management Consistency, brevity, description Versioning (v01, v02, FINAL) Avoid spaces

Directory structure/[Project]/[Grant Number]/[Event]/[Date]

File naming[description]_[instrument]_[location]_[YYYYMMDD].[ext]

Transparency/Sharing Document data: codebook, metadata

File Structure & Naming Examples

Directory Structure/[Project]/[Grant Number]/[Event]/[Date]

/NYCPhysicalActivity/NOT-MH-14-033/Interview/20141109 /Dissertation/LitReview/LibraryLeadership/

File Naming[description]_[instrument]_[location]_[YYYYMMDD].[ext]

PhysicalActivity_InterviewQs_PS193_20141109.doc PhysicalActivity_InterviewResponses_20141022.xls LibraryLeadershipHenson_Article_2011.pdf Leadership_Survey_20130917.doc

Metadata & Description

Variables: labels, meaning, how they were measured, units, codes

Survey questions Experimental procedures Research methodology Statistical analyses performed Preferred data citation

Pew Hispanic Center. (2008). 2007 Hispanic Healthcare Survey [Data file and code book]. Retrieved from http://pewhispanic.org/datasets/

Codebook Examples

Codebook Examples

Data Lifecycle:Publish, Store, Access, Reuse

File size & format Open vs. proprietary

Security Anonymize or encrypt? Levels may vary by access (org. vs. 3rd party)

Data Citation Sharing

Upload data & metadata Institutional repository, data center, etc. Persistent identifier

Institutional Repositories

Institutional Repositories

Institutional Repositories

Institutional Repositories

Dataset Record in IR

Other places data can live…

Figshare ICPSR Github DataUp Dropbox

(or other cloud storage) IF you use proper encryption

Lists of data repositories:  DataCite DataBib 

Data Discovery Data Depositories (previous slide)

ICPSR Figshare

Institutional Repositories OpenDOAR (directory) Specific institutions

Data Catalogs Numeric Data Catalog (Columbia) GeoData (Columbia, others)

Gov & Public Sources (data producers) NYC OpenData Data.gov Census Bureau Bureau of Labor Statistics IMLS (Institute of Museum & Library Services)

Replicated Data

And Finally…Geeky puns.