Coping with Data for WHOI JP Students

Post on 26-Jan-2015

112 views 2 download

description

Data management best practices presentation for JP Students at Woods Hole Oceanographic Institution, 12 April 2014.

Transcript of Coping with Data for WHOI JP Students

Coping With Your Data

Carly Strasser California Digital Library carlystrasser@gmail.com

WHOI 10 April 2014

Tips & Tools

Roadmap

3. Toolbox

1. Background

2. Best practices

C. S

trass

er

From Flickr by robertpaulyoung

Scientists are bad at data management.

Many tables

Embedded figures

my spreadsheet

No headings

my spreadsheet

my spreadsheet

?

From Flickr by ransomtech

Didn’t share the data Didn’t document the data (metadata) Didn’t document provenance/workflow

From Flickr by ransomtech

Reproducibility Transparency Reuse NO

From Flickr by johntrainor

Why should I care?

Because they care:

From Flickr by Redden-McAllister

the Truth

From

san

dier

past

ures

.com

Data management Metadata Data repositories Data sharing

You need to know

about

… “Federal agencies investing in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products.”

Feb 2013

1.  Maximize free public access 2.  Ensure researchers create data

management plans 3.  Allow costs for data preservation and

access in proposal budgets 4.  Ensure evaluation of data management

plan merits 5.  Ensure researchers comply with their data

management plans 6.  Promote data deposition into public

repositories 7.  Develop approaches for identification and

attribution of datasets 8.  Educate folks about data stewardship

From Flickr by Joe Crimmings Photography

data management

From

Flic

kr b

y Bi

g Sw

ede

Guy

Best Practices

From Flickr by Mark Sardella

Plan before data collection

•  Create a key (data dictionary) •  Make sure names are unique •  Define codes

From

Flic

kr b

y ze

bbie

Planning Design sample naming scheme

PhDcomics.com

Planning Design file naming scheme

Use descriptive file names •  Unique •  Reflect contents

From  R  Cook,  ESA  Best  Practices  Workshop  2010  

Bad: Mydata.xls 2001_data.csv best version.txt

Better: Eaffinis_nanaimo_2010_counts.xls

Site name

Year What was measured

Study organism

*Not for everyone

*

Planning Design file naming scheme

From S. Hampton

Planning Design file organization

Biodiversity

Lake

Experiments

Field work

Grassland

Biodiv_H20_heatExp_2005to2008.csv Biodiv_H20_predatorExp_2001to2003.csv … Biodiv_H20_PlanktonCount_2001toActive.csv Biodiv_H20_ChlAprofiles_2003.csv …

From S. Hampton

Planning Design file organization

Consider… •  Dependencies? •  File formats? •  Time of collection? •  Order of analysis?

Workflows!

Planning

Constrain entries Atomize Break down spreadsheets

Design your spreadsheet

From Flickr by Ulleskelf

A relational database is A set of tables Relationships among the tables A language to specify & query the tables

A RDB provides

Scalability: millions+ records Features for sub-setting, querying, sorting Reduced redundancy & entry errors

From Mark Schildhauer

Planning Consider a database

You should invest time in learning databases if your data sets are large or complex

Consider investing time in learning databases if your data are small and humble you ever intend to share your data you are < 30 years old

Planning

From Mark Schildhauer

Consider a database

Store your data in a repository Institutional archive

Discipline/specialty archive

Pick a data repository

From Flickr by torkildr

Ask a librarian

Repos of repos: databib.org re3data.org

Planning

From

Flic

kr b

y se

pa s

ynod

From Flickr by taberandrew

From Flickr by withassociates

What software? What hardware? What personnel?

How often? Set up reminders!

Test system

Decide on preservation/backup Planning

…document that describes what you will

do with your data throughout

the research project

From Flickr by Barbies Land

Write a data management plan!

Planning

DMP components

But they all have different requirements and express them in

different ways

•  What will be collected •  Methods •  Standards •  Metadata •  Sharing/access •  Long-term storage

Planning

From Flickr by Barbies Land

Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community

dmptool.org Planning

During Data Collection & Entry

From Flickr by Julia Manzerova

Realistically: •  Archive .csv version of raw data •  Make a “raw” tab in working data file •  Do all work on other tabs

During collection Keep raw data raw

Raw data as .csv

R script for processing & analysis

During collection

Ideally: •  Use scripts to process data •  Save them with data

Keep raw data raw

During collection Document your workflow

Temperature data

Salinity data

Data import into Excel

Analysis: mean, SD

Graph production

Quality control & data cleaning “Clean” T

& S data

Summary statistics

Data in spread-sheet

Workflow: how you get from the raw data to the final products of your research

Simple workflow: flow chart

During collection

Workflow: how you get from the raw data to the final products of your research

Simple workflow: commented script

•  R, SAS, MATLAB… •  Well-documented code is

Easier to review Easier to share Easier to use for repeat analysis

# % $

&

Document your workflow

Fancy schmancy workflows Resulting output

https://kepler-project.org

During collection Document your workflow

Workflows enable •  Reproducibility •  Transparency •  Reuse

From Flickr by merlinprincesse

During collection Document your workflow

Constrain data entries •  Excel lists •  Data validation •  Google docs forms

Modified from K. Vanderbilt

During collection

Atomize During collection

One piece of information per cell

Create parameter table

From doi:10.3334/ORNLDAAC/777

From doi:10.3334/ORNLDAAC/777

From R Cook, ESA Best Practices Workshop 2010

During collection Break down spreadsheets

Fake a relational database

Create a site table

Why are you promoting

Excel?

During collection Create metadata

Metadata: data reporting

WHO created the data? WHAT is the content

of the data set? WHEN was it created? WHERE was it collected? HOW was it developed? WHY was it developed?

From

Flic

kr b

y /\

/\ich

ael P

atric

|{

During collection Create metadata

Digital context •  Name of the data set •  The name(s) of the data file(s) in the data set •  Date the data set was last modified •  Example data file records for each data type

file •  Pertinent companion files •  List of related or ancillary data sets •  Software (including version number) used to

prepare/read the data set •  Data processing that was performed Personnel & stakeholders •  Who collected •  Who to contact with questions •  Funders

Scientific context •  Scientific reason why the data were

collected •  What data were collected •  What instruments (including model & serial

number) were used •  Environmental conditions during collection •  Temporal & spatial resolution •  Standards or calibrations used

Information about parameters •  How each was measured or produced •  Units of measure •  Format used in the data set •  Precision & accuracy if known

Information about data •  Definitions of codes used •  Quality assurance & control measures •  Known problems that limit data use (e.g.

uncertainty, sampling problems)

During collection Create metadata

•  Provide structure to describe data Common terms | definitions | language | structure

•  Come in many flavors EML , FGDC, ISO19115, DarwinCore,…

•  Can be met using software tools Morpho (EML), Metavist (FGDC), NOAA MERMaid (CSGDM)

What is metadata?

Metadata standards…

During collection

Standard < Create metadata

Back up daily During collection

From Flickr by lippo

From Flickr by see phar

Original Near

Far

During collection

From Flickr by Barbies Land

Remember that data management plan?

Revisit Review Revise

During collection

Schedule a time each week or month

Revisit Review Revise

From Flickr by purplemattfish

From

Flic

kr b

y di

pste

r1

Toolbox

Step-by-step wizard for generating DMP create | edit | re-use | share Free & open to community

dmptool.org Write a DMP

databib.org

Where should I put my data?

Find a repository

Get help

From

Flic

kr b

y th

ewm

att

DCXL blog: dcxl.cdlib.org Toolbox:

Get help

From

Flic

kr b

y No

rth C

arol

ina D

igita

l He

ritag

e Ce

nter

From Flickr by Madison Guy

Get help from your library

carlystrasser@gmail.com

Get help from me

From Flickr by Andy Graulund

Make a resolution • Triage on current

projects • Get advisor, lab

mates, collaborators on board • Do better next time

From

Flic

kr b

y tw

m13

40

Culture Shift Ahead

science source notebook content access data government knowledge

From

Flic

kr b

y cd

sess

ums

From Flickr by dotpolka

Doing science is a privilege. Data hoarding is science malpractice.

Manage & share your data!

Website Email

Twitter Slides

carlystrasser.net carlystrasser@gmail.com @carlystrasser slideshare.net/carlystrasser