Data and Donuts: Data organization

36
Data Organization C. Tobin Magle, PhD Feb. 28, 2017 10:00-11:30 a.m. Morgan Library Computer Classroom 175 *inspired by content from Data Carpentry

Transcript of Data and Donuts: Data organization

Page 1: Data and Donuts: Data organization

Data Organization

C. Tobin Magle, PhDFeb. 28, 2017

10:00-11:30 a.m.Morgan Library Computer

Classroom 175

*inspired by content from Data Carpentry

Page 2: Data and Donuts: Data organization

Hypothesis DataExperimental design

ResultsArticle

Data Management Plans

The research cycle

Page 3: Data and Donuts: Data organization

Main topics

• Hierarchical organizations• Folders in folders• Open Science Framework

• File naming• Human readability• Machine readability

• “Tidy” data in spreadsheets

Page 4: Data and Donuts: Data organization

Hierarchical Organization

Putting your files into a folder system

my_project

Data Notes protocols manuscripts

Paper1

Figures

Text

References

Paper2

Page 5: Data and Donuts: Data organization

Folder systems

• Organize your data hierarchically

• Identify ways to divide your data into categories (Attributes)

• Top level organization is the most important attribute

Page 6: Data and Donuts: Data organization

Questions to ask

• What kinds of files are there? (See data inventory)

• How could you group them?• Project?• Time?• Location?• File type?

• What are the most important attributes?

Page 7: Data and Donuts: Data organization

Exercise: Organize files

• Download Lou’s files (look in the README file for insight)• http://tinyurl.com/hvna4mg

• Create a hierarchical folder structure for Lou• Drag his files into the correct folders• Fix Lou’s README

• Bonus: think about how you’d organize your data.

Page 8: Data and Donuts: Data organization

Example: Lou the first year

Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.

• List the attributes of his project?

• How would you rank these attributes?

Page 9: Data and Donuts: Data organization

Example: Lou the first year

Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.

• List the attributes of his project?

• How would you rank these attributes?

Attributes• Time

Page 10: Data and Donuts: Data organization

Example: Lou the first year

Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.

• List the attributes of his project?

• How would you rank these attributes?

Attributes• Time• Infection Status

Page 11: Data and Donuts: Data organization

Example: Lou the first year

Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.

• List the attributes of his project?

• How would you rank these attributes?

Attributes• Time• Infection Status• Data Type

Page 12: Data and Donuts: Data organization

Tool: Open Science Framework

• Components

• Add-ons

• Contributors

• Wiki

http://help.osf.io/m/collaborating/l/524109-using-the-wiki http://www.slideshare.net/DuraSpace/121014-slides-roadmap-to-the-future-of-share

Page 13: Data and Donuts: Data organization

Organization rules

• Be consistent

• One directory per project

• Separate subdirectories for• Raw data• Processed data• Code• Output

• Make raw data read-only

• Make README fileshttp://help.osf.io/m/60347/l/611391-organizing-files

Page 14: Data and Donuts: Data organization

Components

• “Subprojects”

• Separate privacy settings, contributors, wiki, add-ons, and files.

• Examples:• Different projects: https://osf.io/82fba/• Clinical: https://osf.io/gq4mz/• Mix: https://osf.io/ezcuj/• File types: https://osf.io/if7ug/• Manuscript sections:

https://osf.io/zmja2/

Page 15: Data and Donuts: Data organization

Demo: add files and components

Page 16: Data and Donuts: Data organization

Don’t panic!

• Just try something

• There’s no right answer

• Be consistent

• Write a README.txt filehttp://4vector.com/i/free-vector-don-t-panic-clip-art_103946_Dont_Panic_clip_art_hight.png

Page 17: Data and Donuts: Data organization

File naming conventionsMake file name both human and machine readable.

Page 18: Data and Donuts: Data organization

Use descriptive names

• Bad name: file.txt

• Ok name: 05-07-2016-mouse-data.txt

• Good name: 2016-05-07-mouse-weight.tsv

• Human readability: name contains information about content

Page 19: Data and Donuts: Data organization

Go from general to specific

• Bad name: rep1-5-7-2016-gene-expression.csv

• Good name: 2016-05-07-gene-expression-rep1.csv

• Machine readability: can be sorted meaningfully

Page 20: Data and Donuts: Data organization

Avoid abbreviations

• Bad name: “sprlbgp1”

• Good name: “spencer_lab_group_1”

• Human readability: no one understands your acronyms

Page 21: Data and Donuts: Data organization

Avoid spaces

• Alternatives• Dashes-are-cool.txt• I_also_like_underscores.txt• CamelCaseIsNeatToo.txt

• Machine readability: spaces are delimiters in programming

• Human readability: delineates words

Page 22: Data and Donuts: Data organization

Avoid special characters

• Bad characters:  ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' " 

• Machine readability: can have special meanings in scripting languages

• Example: ~ tells unix to go to your home directory

• Alternatives: underscore (_) dash( - ) dot (.)

Page 23: Data and Donuts: Data organization

Be consistent

• Establishing standards makes data more findable

• Extending standards to everyone who works on a project is even better

Page 24: Data and Donuts: Data organization

Renaming files

• Ways to Automate file renaming• Bulk Rename Utility (Windows, free)• Renamer 5 (Mac)• PSRenamer (Linux, Mac, or Windows, free)

Page 25: Data and Donuts: Data organization

Exercise: Rename Lou’s files

• Use descriptive names

• General to specific

• Avoid abbreviations, spaces and special characters

• Be consistent

Page 26: Data and Donuts: Data organization

Tidy dataHow to organize your data efficiently in spreadsheets

Page 27: Data and Donuts: Data organization

Spreadsheets as lab notebook

• Color coding

• Formatting

• Notes

• Calculations

• Graphs/Tables

Page 28: Data and Donuts: Data organization

Downsides

• Computers don’t understand notes/formatting/color coding

• Calculations/Graphs/Tables in spreadsheets are inefficient

• “Tidy data” + automation = saved time

Page 29: Data and Donuts: Data organization

Using spreadsheets wisely

• Don’t put multiple tables in one sheet

• Don’t use multiple sheets

• Use descriptive field names

• Don’t mix notes and data

Page 30: Data and Donuts: Data organization

Tidy Data

1. Columns as variables

• Don’t combine multiple pieces of info in one column

2. Rows as observations

• One measured value

Page 31: Data and Donuts: Data organization

Exercise: Tidy Lou’s data

• Open MouseInventory.xls• Is he using spreadsheets wisely?• Is each column a variable?• Is each row an observation?

• Open the January files for both weight and cytokines• What variables are being measured? –ie, what columns should we

have?• Can we combine some of these tables?

Page 32: Data and Donuts: Data organization

Exercise: Data carpentry ecology

• Lesson: http://www.datacarpentry.org/spreadsheet-ecology-lesson/

• File: https://ndownloader.figshare.com/files/2252083

• Goal: combine data from first 2 tabs into one table• Make a new tab, don’t edit the raw data!

Page 33: Data and Donuts: Data organization

Example: Supplemental_data_1_xls

• https://figshare.com/articles/Supplemental_data_1_xls/4055544

• Description: “Table of the results given by HPLC analysis of the samples. Key: Rt, retention time; +, presence of peak; -, absence of peak.”

Page 34: Data and Donuts: Data organization

Example: cck8_xls

• https://figshare.com/articles/cck8_xls/3505772

• Description: “This data are from CCK-8 assay and ELISA.”

Page 35: Data and Donuts: Data organization

Example: endo_et_al_table1_xls

• https://figshare.com/articles/endo_et_al_table1_xls/2069573

• Description: Table 1 from Endo et al. PeerJ 2016 https://doi.org/10.7717/peerj.1562/table-1

Page 36: Data and Donuts: Data organization

Need help?

• Email: [email protected]

• Data Management Services website: http://lib.colostate.edu/services/data-management

• Data Carpentry: http://www.datacarpentry.org/

• Software Carpentry: http://software-carpentry.org/