Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA...
-
Upload
karen-elliott -
Category
Documents
-
view
220 -
download
0
Transcript of Software workflows as research objects & GigaGalaxy Rob L Davidson, Chris I Hunter ISI CODATA...
Software workflows as research objects & GigaGalaxy
Rob L Davidson, Chris I HunterISI CODATA International Training Workshop on Big Data
11th March 2015DOI: 10.6084/m9.figshare.1330219
Article: http://econ.st/1o12gCN DOI: 10.6084/m9.figshare.1330219
DOI: 10.6084/m9.figshare.1330219
• Big data! (The new oil)
• New dot com bubble?
Article: http://bit.ly/1AN8ysJ DOI: 10.6084/m9.figshare.1330219
Source: @flowchainsensei
Analysis
Software
DOI: 10.6084/m9.figshare.1330219
Article: http://bit.ly/1xdCxbY DOI: 10.6084/m9.figshare.1330219
Article: http://bit.ly/1Mdll03 DOI: 10.6084/m9.figshare.1330219
Yay, we’re all unicorns!
from: Are you recruiting a data scientist or a unicorn?
DOI: 10.6084/m9.figshare.1330219http://ubm.io/1Gpxizh
But why arewe sad unicorns?
DOI: 10.6084/m9.figshare.1330219
Measuring software reproducibility
• Systematic study:• 515 papers (429 conference, 86 journal)• <30% reproducible
DOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu
Measuring software reproducibilityDOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu
Reasons for failure
“The good news is that I was able to find some code. I am just hoping that it is a stable working version of the code... I have lost some data... The bad news is that the code is not commented and/or clean. So, I cannot really guarantee that you will enjoy playing with it.”
DOI: 10.6084/m9.figshare.1330219http://reproducibility.cs.arizona.edu
Cost of failure
• Waste time• Waste money• Frustrating• Distrust
DOI: 10.6084/m9.figshare.1330219
How to fix it
DOI: 10.6084/m9.figshare.1330219
The path to enlightenment
• Look to the experts (4 x 10 simple rules)• Share code
– licenses• Share environment
– Codify the environment• Share workflows
– All parameters, versions, order of steps– GalaxyProject.org
• Share outputs– Share intermediate results– Share code for figures– Codify publications
DOI: 10.6084/m9.figshare.1330219
Look to the expertsDOI: 10.6084/m9.figshare.1330219
Look to the expertsDOI: 10.6084/m9.figshare.1330219
A word from the experts: 1
• Keep it simple– Don’t be a perfectionist– Aim for multiple versions– Optimise/improve later– Get feedback/help from community
• Hastings #1 + Prlic # 5
DOI: 10.6084/m9.figshare.1330219
A word from the experts: 2
• Versioning – Use a versioning system (e.g. Github)– Allow others to know what version they use– Release early, release often (Linus Torvalds)– Get help from community
• Seemen # 3, Hastings # 10, Sandve #3/4
DOI: 10.6084/m9.figshare.1330219
A word from the experts: 3
• Use good coding practice– You don’t have to be the best– Learn from others– Become involved in a community– Write as though others will be watching
• Prlic #2 + all of Seemen and Hastings
DOI: 10.6084/m9.figshare.1330219
A word from the experts: highlight
• Start simple• Release early• Use versioning• Build a community• Get community feedback, testing, support
• …but wait, won’t that mean???
DOI: 10.6084/m9.figshare.1330219
Sharing code
DOI: 10.6084/m9.figshare.1330219
Sharing code
• “Scientific software…public release is then only considered around the time of publication” – prlic #4
• “the fear of getting scooped”– Reality: “staking a claim in the field”
DOI: 10.6084/m9.figshare.1330219
Sharing code: don’t worry• Share early
– Be simple– Don’t be perfectionist
• CRAPL license
Source: http://matt.might.net/articles/crapl/ DOI: 10.6084/m9.figshare.1330219
Sharing code: licenses• Know your licenses
– Apache License 2.0– BSD 3-Clause “New” or “Revised”– BSD 2-Clause “simplified” or “FreeBSD”– GNU (GPL)– MIT– Mozilla Public License 2.0– etc
Source: http://opensource.org/licenses DOI: 10.6084/m9.figshare.1330219
Sharing code: repositories
• Github• Sourgeforge• Zenodo• GigaDB/GigaGalaxy
• Versioning, sharing, collaboration, community feedback
DOI: 10.6084/m9.figshare.1330219
Sharing environment
DOI: 10.6084/m9.figshare.1330219
Your environment
• How hard would it be to start from scratch?• What if you move from Ubuntu to Centos?
• IF it took you a while to set up your box, if you hesitate to set it up for your colleagues…– Create a virtual machine or ‘docker’ image that
can be shared whole. – Time-stamp of working system
DOI: 10.6084/m9.figshare.1330219
Share your environment• Virtual machine
– Copy your exact environment– If it works for you, it works for anyone– Reproducibility, frozen in time
DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-23
Share your environment
• Docker– ‘light’ vm – Discrete unit of code+environment– Can be called like a compiled tool
• New possibilities e.g. nucleotid.es benchmarking– Data-driven peer-review
DOI: 10.6084/m9.figshare.1330219http://nucleotid.es/
Share your environment
• VM = black box?• Docker == black box!• http://ivory.idyll.org/blog/vms-considered-
harmful.html
DOI: 10.6084/m9.figshare.1330219
Codify your environment
• Provisioning scripts are ‘research objects’• Improves adaptability (easier to recode for
alternative OS etc)• Builds in extra documentation• Easier to share – although GigaDB still wants a
compiled snapshot (i.e. full machine)
DOI: 10.6084/m9.figshare.1330219
Short list of provisioning systems
• Vagrant• Chef• Salt• Puppet• Ansible
• Many more – see link for info
DOI: 10.6084/m9.figshare.1330219http://bit.ly/1wrYiuI
Sharing workflows
DOI: 10.6084/m9.figshare.1330219
Share your workflow
• Any analysis is a string of tools with a great many parameters
• The order of the sequence, the version of each part and the inputs and outputs are never fully explained
• These should be shared!• Help is at hand: there are many ‘workflow’
systems for this
DOI: 10.6084/m9.figshare.1330219
Workflow systems
• Galaxy• Knime• Taverna• Many more…
• GigaScience uses Galaxy– galaxy.cbiit.cuhk.edu.hk
DOI: 10.6084/m9.figshare.1330219
Galaxy
Over 36,000 main Galaxy server users
Over 1,000 papersciting Galaxy use
Over 55 Galaxyservers deployed
Open source
http://galaxyproject.org DOI: 10.6084/m9.figshare.1330219
Galaxy User Interface
Tool List Tool Parameters History/results
DOI: 10.6084/m9.figshare.1330219
Galaxy: Under the hood
<tool name=”myfunction”> <command> python myfunction input1 </command> <inputs> <param format=”txt” name=”input1”> </inputs> <outputs> <data format=”csv” name=”output1”> </outputs></tool>
Basic xml 'wrapper'
Describe inputs and outputs
Calls command
Monitors for output
Logs/returns to 'history'
DOI: 10.6084/m9.figshare.1330219
Galaxy Workflow: visualiseDOI: 10.6084/m9.figshare.1330219
Galaxy Workflow: visualiseDOI: 10.6084/m9.figshare.1330219
Galaxy Workflow: visualise
DOI: 10.6084/m9.figshare.1330219
Galaxy Workflow: exportDOI: 10.6084/m9.figshare.1330219
Citable workflowAdd as supplemental files or publish with distinct DOI via GigaDB or FigShare
DOI: 10.6084/m9.figshare.1330219
Galaxy Toolshed
https://toolshed.g2.bx.psu.edu/
Many 'omics, stats,
visualisations
2700+ tools!
Download;Run instantly
DOI: 10.6084/m9.figshare.1330219
GigaGalaxyWeb Site: galaxy.cbiit.cuhk.edu.hk DOI: 10.6084/m9.figshare.1330219
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
SOAPdenovo2 workflows implemented in
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Also will be available to download by >50K Galaxy users in
galaxy.cbiit.cuhk.edu.hk
Can we reproduce results? SOAPdenovo2 S. aureus pipeline
Sharing outputs
DOI: 10.6084/m9.figshare.1330219
Share outputs – intermediate results
• Workflow systems help with this– Results in history
• If a part of your analysis can’t be replicated– Requires a license– Is no longer compatible – Just plain won’t work
• The rest of the analysis can still be used
DOI: 10.6084/m9.figshare.1330219
Share outputs – code for figures
• Data transform for figures– Remove points?– 3D: choose ‘best angle’? – PCA: choose ‘best components’?
• Figure choice– Bar chart or box&whisker?
• Allow reinterpretation!!!
DOI: 10.6084/m9.figshare.1330219
Share outputs – codify publication• “This article is an example of a literate programming document. It has
been created in R using the knitr package. Figures and tables in this paper are generated dynamically as the document is compiled. Several R packages are required to run the analysis. Materials are archived in the Gigascience database”
DOI: 10.6084/m9.figshare.1330219DOI:10.1186/2047-217X-3-3
Literate coding options
• See listing: http://www.gigasciencejournal.com/content/3/1/19– R: KnitR, Sweave, R-Markdown– Javascript: Tangle, Active Markdown (CoffeeScript)– Python: Ipython Notebooks – iReport links this functionality for Galaxy
DOI: 10.6084/m9.figshare.1330219
SUMMARY
The path to enlightenment
• Look to the experts (4 x 10 simple rules)• Share code
– licenses• Share environment
– Codify the environment• Share workflows
– All parameters, versions, order of steps– GalaxyProject.org
• Share outputs– Share intermediate results– Share code for figures– Codify publications
DOI: 10.6084/m9.figshare.1330219
All Your Research Objects
• Project proposal • Project experimental SOPs • Images of equipment, subjects, conditions• RAW data• Meta-data• Analysis code, parameters, pipelines• Analysis environment, VM or provisioning script• Intermediate results• Publication figures/images/tables: codify• Publication text
DOI: 10.6084/m9.figshare.1330219
DOI: 10.6084/m9.figshare.1330219
@gigasciencefacebook.com/GigaScience
Scott EdmundsPeter LiChris HunterRob DavidsonJesse Si ZheNicole NogoyLaurie GoodmanAmye Kenall (BMC)
www.gigadb.orggalaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
blogs.biomedcentral.com/gigablog/
DOI: 10.6084/m9.figshare.1330219