2016-09-04 BioExcel SIG, ECCB, Amsterdam
Advances in Scientific Workflow Environments
Carole Goble, Stian Soiland-ReyesThe University of Manchester
[email protected]://esciencelab.org.uk/
What is a Workflow? • Orchestrating multiple
computational tasks• Managing the control and
data flow between them• In a world that is
homogeneous or heterogeneous
• Tasks– Local / remote– Local / third party– White, grey or black boxes– Reliable / fragile– Reserved / dynamic– Various underpinning
infrastructure– Various access controls
BioExcel: Biomolecular recognition
What is a Workflow? Automation
– Automate computational aspects– Repetitive pipelines, sweep campaigns
Scaling – compute cycles– Make use of computational
infrastructure & handle large dataAbstraction – people cycles
– Shield complexity and incompatibilities– Report, re-use, evolve, share, compare– Repeat – Tweak - Repeat– First class commodities
Provenance - reporting– Capture, report and utilize log and
data lineage auto-documentation– Traceable evolution, audit,
transparency– Compare
With thanks to Bertram Ludascher: WORKS 2015 Keynote
FindableAccessibleInteroperableReusable(Reproducible)
https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/
Laser Interferometer Gravitational-Wave Observatory – first detection of gravitational waves from colliding black holes
Morphological, hemodynamic and structural analyses linked to aneurysm genesis, growth and rupture.
[Susheel Varma] http://www.vph-share.eu/
http://taverna.org.uk
Galaxy https://usegalaxy.org/
Marine metagenomics
Workflow Driven
+ Bespoke Scripts
[Rob Finn]
Open PHACTShttps://www.knime.org/
BioExcel workflow
https://www.openphacts.org/
Targets
Pharmacological queriestarget, compound and pathway data
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115460
Scripts, Ensemble toolkit, execution patterns
http://www.extasy-project.org/
http://www.myexperiment.org
WF Zoo
Workflow Patterns, templates
Data wrangling& analytics
Simulations
Instrumentpipelines++
http://tpeterka.github.io/maui-project/The Future of Scientific Workflows, Report of DOE Workshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd
Workflow Patterns, templates
Data wrangling& analytics
Simulations
Instrumentpipelines++ Garijo et al Common Motifs in Scientific Workflows: An Empirical Analysis, FGCS, 36, July 2014, 338–351
Workflow Patterns, templates• Long running and complex code• Tunable parameters and input sets• Simulation sweeps / iterations• Ensembles, comparisons • Tricky set-ups, human-in-the-loop
interaction• Computational steering• In situ workflows – multiple tasks,
same box, within fixed time– data locality. – human-in-the-loop. – capture provenance.
Data wrangling& analytics
Simulations
Instrumentpipelines++
Traction + ExamplesReuse behaviours
Exploratory vs ProductionDifferent kinds of user / deployment
Developer – User Ratios
BiologistDeveloper ComputationalScientist
Embe
d in A
pplic
ation
Embe
d in p
latfor
m
Embe
d in in
frastr
uctu
re
Existing computational research workflow systems
https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems
WFMS Zoo
Existing computational research workflow systems
https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems
Existing computational research workflow systems
https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems
“Multi-scale” WFMS• Workflow
Management System– Its design and
reporting environment– Its execution
environment• The tasks
– tools, codes and services and their execution environments
• Stack layer– App level, infrastructure
level
Component making
Tasks loosely coupled through files, • execute on geographically
distributed clusters, clouds, grids across systems
• execute on multiple facilities• call host services (web / grid
services)
DAICDistributed Area/Instrument Computing
“Multi-scale” WFMS
Tasks tightly coupled• exchanging info over
memory/storage• network of supercomputers • In situ workflows – multiple tasks, same
box, within fixed time
HPC
InteroperabilityPortabilityGranularityMaintenance
Workflow Environment Ecosystem
Copernicus workflow engine for parallel adaptive molecular dynamics
• Peer-to-peer distributed computing platform– high-level parallelization of
statistical sampling problems• Consolidation of
heterogeneous compute resources
• Automatic resource matching of jobs against compute resources
• Automatic fault tolerance of distributed work
• Workflow execution engine to define a problem (reporting) and trace its results live (provenance)
• Flexible plugin facilities – programs to be integrated to the
workflow execution engine
Free Energy Workflow using GROMACS
http://copernicus-computing.org/
COMPs/PyCOMPs: Programmer Productivity framework
• Sequential programming– Parallelisation and distribution
heavy-lifting– Dependency detection
• Infrastructure unaware– Abstract application from
underlying infrastructure– Portability
• Standard Programming Languages– Java, Python, C/C++
• No (or few!) APIs– Standard Java
Shield the user/programmer
Exposure to the infrastructure
System Design
Resource provisioning
Adaptive/dynamic workflows
Manage/minimize data transfers
Smart parallelism
Code staging
Data stagingFail-over
Human in the loop
OS/R Guarantees
Service Guarantees
Stop Press!GUIs not essential!• Canvas, drag-drop blocks,
arrows, run button• Command-line & embedding
in developer or user applications
Scripts can be workflows!• WMS<->Scripts• Script vs Workflows/ASAP:
– Automation: *****– Scaling: **– Abstraction: *– Provenance: **
Stop Press!GUIs not essential!• Canvas, drag-drop blocks,
arrows, run button• Command-line & embedding
in developer or user applications
Scripts can be workflows!• WMS <-> Scripts• Script vs Workflows/ASAP:
– Automation: *****– Scaling: **– Abstraction: *– Provenance: **
Work close to a problem-specific ad-hoc data model
Domain Specific Language "programming-lite" scripts
• wire with declarative "makefile"-like DAG
Plus
• procedural scripting and expressions in languages like Javascript and Python
Nextflow, SnakeMake, Common Workflow Language
GUIs Are Essential take-up by the user base
Workflowising script software eco-systemsprime example: provenance
ASAP• common, interoperable
provenance recording– W3C PROV
ASAP• YesWorkflow.org
– Annotations in script yield workflow view
ASAP• Library profilers
– noWorkflow• runtime provenance
recorders– Sumatra, RDataTracker
Provenance the link between computation and results
W3C PROV model standard
record for reportingcompare diffs/discrepanciesprovenance analyticstrack changes, adapt partial repeat/reproducecarry attributionscompute creditscompute data quality/trustselect data to keep/releaseoptimisation and debugging
Metadata propagation –where was the physical sample collected, and who should be attributed?
Task-based abstractions: simplifying provenance using motifs and tool annotations“Free energy calculation” rather than 5 steps including preparation of PDB files and GROMACS execution
Provenance the link workflow variants and workflow reuse and repurpose
W3C PROV model standard?record for reportingcompare diffs/discrepanciesprovenance analyticstrack changes, adapt carry attributionscompute design creditsversioning, forking, cloning
Nested workflows functions by stealth
Copy and paste fragmentationDesigning for reuse Find and Go
Software practicesSystematic reuse
Guidelines for persistently identifying software using DataCitehttps://epubs.stfc.ac.uk/work/24058274
https://www.force11.org/software-citation-principles
ASAP Wfms for FAIR Science
Automate: workflows, programs and services folks already use or want to use
Scale: Enable computational productivity
Abstract: Enable human productivity
Provenance: Record and use
Provenance
Reproducibility
PortabilityReuse
UsabilityUnderstanding
Validation
Workflow Plugged in Code
Reporting Comparison
Interoperability
Thanks to Bertram Ludascher
Dependency Management
Codes Behaviours & Reliability
● Task-specific “mini-workflow” fragments– e.g. using Gromacs, CPMD,
HADDOCK● Packaged
– EGI VM images and Docker containers
● Backed by existing registries– ELIXIR’s bio.tools and EGI
App DB● Instantiated as cloud
instances– private (Open Nebula, Open
Stack)– public (e.g. Amazon AWS )
Application Building BlocksBioExcel Virtualised Software Library“transversal workflow units”, higher level operations
BioExcel Use cases
● Genomics● Ensembl Molecular
simulations● Free Energy simulations● Multiscale modelling of
molecular basis for odor and taste
● Biomolecular recognition● Pharmacological queries● Virtual Screening
Finding valid pathways through free-energy landscapes: implementation of the “string of swarms” method using Copernicus as a workflow manager, and GROMACS as a compute engine.
Workflow Interoperability. • Common format for bioinformatics tool
& workflow execution• Community based standards effort• Designed for clusters & clouds• Supports the use of containers (e.g.
Docker)• Specify data dependencies between
steps• Scatter/gather on steps• Nest workflows in steps
• Develop your pipeline on your local computer (optionally with Docker)
• Execute on your research cluster or in the cloud
• Deliver to users via workbenches
• EDAM ontology (ELIXIR-DK) to specify file formats and reason about them: “FASTQ Sanger” encoding is a type of FASTQ file
Workflow Research Object Bundleresearchobject.org
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, J Web Semantics doi:10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip
Generic Grid middleware
Workflow bus: provide services for1) Interoperability and integration, 2) composition, 3) provenance,
4) Enactment, 5) Human in the loop computing
Taverna Kepler Triana VLAMG
Sub workflow 1
Sub workflow 2
Sub workflow 3
Scientific experiment: a meta workflow
Sub workflow 4
Generic Grid middleware
Workflow bus: provide services for1) Interoperability and integration, 2) composition, 3) provenance,
4) Enactment, 5) Human in the loop computing
Taverna Kepler Triana VLAMG
Sub workflow 1
Sub workflow 2
Sub workflow 3
Scientific experiment: a meta workflow
Sub workflow 4
Z. Zhao et al., “Workflow bus for e-Science”, in IEEE e-Science 2006, Amsterdam
2007
2015
http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular-research/
Adam Hospital (IRB), Anna Montras (IRB), Stian Soiland-Reyes (UNIMAN), Alexandre Bonvin (UU), Adrien Melquiond (UU), Josep Lluís Gelpí (BSC), Daniele Lezzi (BSC), Steven Newhouse (EBI), Jose A. Dianes (EBI), Mark Abraham (KTH), Rossen Apostolov (KTH), Emiliano Ippoliti (Jülich), Adam Carter (UEDIN), Darren J. White (UEDIN)
Slides: Bertram Ludascher, Ewa Deelman, Vasa Curcin, Paolo Missier, Pinar Alper, Susheel Varma, Rob Finn, Michael Crusoe, Rizos Sakellariou
Sign upASAP!
Bonus Slides
Top Related