Post on 16-Apr-2017
Data Organization
C. Tobin Magle, PhDFeb. 28, 2017
10:00-11:30 a.m.Morgan Library Computer
Classroom 175
*inspired by content from Data Carpentry
Hypothesis DataExperimental design
ResultsArticle
Data Management Plans
The research cycle
Main topics
• Hierarchical organizations• Folders in folders• Open Science Framework
• File naming• Human readability• Machine readability
• “Tidy” data in spreadsheets
Hierarchical Organization
Putting your files into a folder system
my_project
Data Notes protocols manuscripts
Paper1
Figures
Text
References
Paper2
Folder systems
• Organize your data hierarchically
• Identify ways to divide your data into categories (Attributes)
• Top level organization is the most important attribute
Questions to ask
• What kinds of files are there? (See data inventory)
• How could you group them?• Project?• Time?• Location?• File type?
• What are the most important attributes?
Exercise: Organize files
• Download Lou’s files (look in the README file for insight)• http://tinyurl.com/hvna4mg
• Create a hierarchical folder structure for Lou• Drag his files into the correct folders• Fix Lou’s README
• Bonus: think about how you’d organize your data.
Example: Lou the first year
Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.
• List the attributes of his project?
• How would you rank these attributes?
Example: Lou the first year
Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.
• List the attributes of his project?
• How would you rank these attributes?
Attributes• Time
Example: Lou the first year
Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.
• List the attributes of his project?
• How would you rank these attributes?
Attributes• Time• Infection Status
Example: Lou the first year
Lou is a first year graduate student working on a project in a biomedical research laboratory. He’s trying to decipher data left by a former post doc as a start for his thesis project. For one year, the postdoc recorded weight daily and cytokine levels monthly from 16 mice. Half were infected with a parasite, half were treated with saline.
• List the attributes of his project?
• How would you rank these attributes?
Attributes• Time• Infection Status• Data Type
Tool: Open Science Framework
• Components
• Add-ons
• Contributors
• Wiki
http://help.osf.io/m/collaborating/l/524109-using-the-wiki http://www.slideshare.net/DuraSpace/121014-slides-roadmap-to-the-future-of-share
Organization rules
• Be consistent
• One directory per project
• Separate subdirectories for• Raw data• Processed data• Code• Output
• Make raw data read-only
• Make README fileshttp://help.osf.io/m/60347/l/611391-organizing-files
Components
• “Subprojects”
• Separate privacy settings, contributors, wiki, add-ons, and files.
• Examples:• Different projects: https://osf.io/82fba/• Clinical: https://osf.io/gq4mz/• Mix: https://osf.io/ezcuj/• File types: https://osf.io/if7ug/• Manuscript sections:
https://osf.io/zmja2/
Demo: add files and components
Don’t panic!
• Just try something
• There’s no right answer
• Be consistent
• Write a README.txt filehttp://4vector.com/i/free-vector-don-t-panic-clip-art_103946_Dont_Panic_clip_art_hight.png
File naming conventionsMake file name both human and machine readable.
Use descriptive names
• Bad name: file.txt
• Ok name: 05-07-2016-mouse-data.txt
• Good name: 2016-05-07-mouse-weight.tsv
• Human readability: name contains information about content
Go from general to specific
• Bad name: rep1-5-7-2016-gene-expression.csv
• Good name: 2016-05-07-gene-expression-rep1.csv
• Machine readability: can be sorted meaningfully
Avoid abbreviations
• Bad name: “sprlbgp1”
• Good name: “spencer_lab_group_1”
• Human readability: no one understands your acronyms
Avoid spaces
• Alternatives• Dashes-are-cool.txt• I_also_like_underscores.txt• CamelCaseIsNeatToo.txt
• Machine readability: spaces are delimiters in programming
• Human readability: delineates words
Avoid special characters
• Bad characters: ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ' "
• Machine readability: can have special meanings in scripting languages
• Example: ~ tells unix to go to your home directory
• Alternatives: underscore (_) dash( - ) dot (.)
Be consistent
• Establishing standards makes data more findable
• Extending standards to everyone who works on a project is even better
Renaming files
• Ways to Automate file renaming• Bulk Rename Utility (Windows, free)• Renamer 5 (Mac)• PSRenamer (Linux, Mac, or Windows, free)
Exercise: Rename Lou’s files
• Use descriptive names
• General to specific
• Avoid abbreviations, spaces and special characters
• Be consistent
Tidy dataHow to organize your data efficiently in spreadsheets
Spreadsheets as lab notebook
• Color coding
• Formatting
• Notes
• Calculations
• Graphs/Tables
Downsides
• Computers don’t understand notes/formatting/color coding
• Calculations/Graphs/Tables in spreadsheets are inefficient
• “Tidy data” + automation = saved time
Using spreadsheets wisely
• Don’t put multiple tables in one sheet
• Don’t use multiple sheets
• Use descriptive field names
• Don’t mix notes and data
Tidy Data
1. Columns as variables
• Don’t combine multiple pieces of info in one column
2. Rows as observations
• One measured value
Exercise: Tidy Lou’s data
• Open MouseInventory.xls• Is he using spreadsheets wisely?• Is each column a variable?• Is each row an observation?
• Open the January files for both weight and cytokines• What variables are being measured? –ie, what columns should we
have?• Can we combine some of these tables?
Exercise: Data carpentry ecology
• Lesson: http://www.datacarpentry.org/spreadsheet-ecology-lesson/
• File: https://ndownloader.figshare.com/files/2252083
• Goal: combine data from first 2 tabs into one table• Make a new tab, don’t edit the raw data!
Example: Supplemental_data_1_xls
• https://figshare.com/articles/Supplemental_data_1_xls/4055544
• Description: “Table of the results given by HPLC analysis of the samples. Key: Rt, retention time; +, presence of peak; -, absence of peak.”
Example: cck8_xls
• https://figshare.com/articles/cck8_xls/3505772
• Description: “This data are from CCK-8 assay and ELISA.”
Example: endo_et_al_table1_xls
• https://figshare.com/articles/endo_et_al_table1_xls/2069573
• Description: Table 1 from Endo et al. PeerJ 2016 https://doi.org/10.7717/peerj.1562/table-1
Need help?
• Email: tobin.magle@colostate.edu
• Data Management Services website: http://lib.colostate.edu/services/data-management
• Data Carpentry: http://www.datacarpentry.org/
• Software Carpentry: http://software-carpentry.org/