Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations...
Transcript of Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations...
![Page 1: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/1.jpg)
Supported by
Towards Reproducible Data Analysis
Using Container Technologies
Sergio Maffioletti
EnhanceR project director
UZH/S3IT
![Page 2: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/2.jpg)
https://www.enhancer.ch
Disclaimer
What I’m presenting here is the result of a personal experience plus the outcomes of different discussions
within the EnhanceR project.
i.e.:
if you like the talk, congratulate with me…if you don’t, blame EnhanceR
![Page 3: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/3.jpg)
https://www.enhancer.ch
What are we going to talk about ?
• Context• What is the user story we have in mind ?• Let’s build the infrastructure support• Let’s not stop here: building containers for/with end-users• One more step: what do we put inside container ?• Main challenges and open questions
![Page 4: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/4.jpg)
https://www.enhancer.ch
Who is EnhanceR again ?
![Page 5: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/5.jpg)
https://www.enhancer.ch
What problems are we facing ?
Reproducible data analysis
“Reproducibility is just collaboration with people you don’t know, including yourself next week”
— Philip Stark, UC Berkeley Statistic
![Page 6: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/6.jpg)
https://www.enhancer.ch
Context
Repeatability (Same team, same experimental setup): The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation.
Replicability (Different team, same experimental setup): The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author's own artifacts.
Reproducibility (Different team, different experimental setup): The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.
![Page 7: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/7.jpg)
https://www.enhancer.ch
Let’s simplify...
Peng, R. D. (2011). Reproducible research in computational science. Science (New York, Ny), 334(6060), 1226.
![Page 8: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/8.jpg)
https://www.enhancer.ch
What is the user story we have in mind ?
on average a researcher:• develops on personal server• changes code and data as research progress• finally gets publishable results
• sometimes running on a large-scale research IT infrastructure
• prepares slides / images / tables / manuscript• publishes manuscript
• at the end of a review process
![Page 9: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/9.jpg)
https://www.enhancer.ch
What is the user story we have in mind ?
Researcher’s side recommendations for Open Science:
● Share data, software, workflows and other digital artifacts.● Persistent links should appear in the published article for data,
code, and digital artifacts. ● Citation should be standard practice, to enable credit for shared
digital scholarly objects.● Document digital scholarly artifacts, to facilitate reuse.● Use Open Licensing when publishing digital scholarly objects.
![Page 10: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/10.jpg)
https://www.enhancer.ch
What does this means for a service provider ?
● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +
best-practises + support
* I know - I’m intentionally skipping the business aspect of this...
![Page 11: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/11.jpg)
https://www.enhancer.ch
What does this means for a service provider ?
● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +
best-practises + support● Why?
○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources.
* I know - I’m intentionally skipping the business aspect of this...
![Page 12: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/12.jpg)
https://www.enhancer.ch
What does this means for a service provider ?
● “Reproducible Data Analysis as a service” implies looking at the full stack of the service* ○ infrastructure + tools + competences + policies +
best-practises + support● Why?
○ understand user-side - anticipate issues; steer adoption and development; enforce policies; better plan resources.
● at the end ?○ we become a valuable asset for a research group○ we actually help them
* I know - I’m intentionally skipping the business aspect of this...
![Page 13: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/13.jpg)
https://www.enhancer.ch
Let’s build the infrastructure
what container technology
orchestration
integration withresource management
storage for data and container’s images
deployment and management
monitoring
![Page 14: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/14.jpg)
https://www.enhancer.ch
Let’s build the infrastructure
validation andverification
automatedpolicies
scanning signing
https://www.docker.com
https://www.enhancer.ch/pipeline
![Page 15: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/15.jpg)
https://www.enhancer.ch
Let’s not stop here
what to consider● Automated build / integration with CD/CI● design strategies● naming schema● Path binding● documentation, metadata and runner script
building containers for/with end-users
competences● version control - CD/CI● container build process
opportunities● development best practises● embed policies● standardise assumptions
![Page 16: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/16.jpg)
https://www.enhancer.ch
Container design strategies
https://www.enhancer.ch/pipeline
![Page 17: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/17.jpg)
https://www.enhancer.ch
what do we put inside container now ?
https://nbis-reproducible-research.readthedocs.io/en/course_1811/tutorial_intro/
what to consider: ● Track software dependencies:
● in-container executions:
competences:● track requirements in sw
development● sw deployment - CD/CI
opportunities:● end-user best practices● better handling of sw
dependencies
![Page 18: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/18.jpg)
https://www.enhancer.ch
Open questions
Infrastructure / Pull● what containers shall I allow on my infrastructure ?● how do I make sure cited container is exactly what I’m getting ?● how do I verify and validate containers when we deploy them on our infrastructure● how do I know what the container is doing ?● how do I know whether the container has the latest security patch ?
Run● how do I make sure a deployed container runs ‘as documented’ on my data ?● “how do I find a container that I need for running RNAseq ?”
Build● what assumptions can I make when building a container and what I should try to avoid ?
○ data mapping in and out / user privileges /● where do I publish my container and how do I get a DOI for the publication ?● how do I publish my container so that people can find it for their purposes ? (metadata)● how do I describe/document my container’s behaviour
![Page 19: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/19.jpg)
https://www.enhancer.ch
Main challenges
● Social○ adoption by end-users○ how to address: “is it worth the investment ?”
● Technical○ scale-out / orchestration○ integration of specialised resources (e.g. GPU)○ multi-tenancy - privileges○ documented assumptions within the containers○ maintenance
■ bugfix and security○ portability vs performance
![Page 20: Towards Reproducible Data Analysis Using Container ...€¦ · Researcher’s side recommendations for Open Science: Share data, software, workflows and other digital artifacts. Persistent](https://reader033.fdocuments.net/reader033/viewer/2022060417/5f14679168ec2840df2c01cf/html5/thumbnails/20.jpg)
https://www.enhancer.ch
Acknowledgments
● Guidelines for pipeline interoperability using containers○ https://www.enhancer.ch/pipeline
● Survey for Research IT Infrastructure providers○ https://forms.gle/JBW78qDPWabd4GDR8