Introduction to Scientific Workflows and the KEPLER System Instructors: Bertram Ludaescher Ilkay...

Click here to load reader

  • date post

    22-Jan-2016
  • Category

    Documents

  • view

    214
  • download

    0

Embed Size (px)

Transcript of Introduction to Scientific Workflows and the KEPLER System Instructors: Bertram Ludaescher Ilkay...

Insert Title HereIntroduction to Scientific Workflows and the KEPLER System
Instructors:
Overview
11:15-12:00 Scientific Workflows in KEPLER live demo, brains-on session
… but first, one more time … (déjà déjà vu)
TM
Information Integration Challenges:
Grid middleware technologies
+ e.g. single sign-on, platform independence, transparent use of remote resources, …
Syntax & Structure
heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …)
heterogeneous schemas (one for each DB ...)
Database mediation technologies
Semantics
Knowledge representation & semantic mediation technologies
+ “smart” data discovery & integration
+ e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!
Scientific Workflows, B. Ludaescher & I. Altintas
Information Integration Challenges:
Synthesis of applications, analysis tools, data & query components, … into “scientific workflows”
How to make use of these wonderful things & put them together to solve a scientist’s problem?
Scientific Problem Solving Environments (PSEs)
GEON Portal and Workbench (“scientist’s view”)
+ ontology-enhanced data registration, discovery, manipulation
+ creation and registration of new data products from existing ones, …
GEON Scientific Workflow System (“engineer’s view”)
+ for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools …
+ e.g., creation of new datasets from existing ones, dataset registration,…
Scientific Workflows, B. Ludaescher & I. Altintas
What is a Scientific Workflow (SWF)?
Goals:
automate a scientist’s repetitive data management and analysis tasks
typical phases:
Typical requirements/characteristics:
advanced programming constructs (map(f), zip, takewhile, …)
logging, provenance, “registering back” (intermediate) products…
… easy to recognize a SWF when you see one!
Scientific Workflows, B. Ludaescher & I. Altintas
Promoter Identification Workflow
Ecology: GARP Analysis Pipeline for Invasive Species Prediction
Source: NSF SEEK (Deana Pennington et. al, UNM)
EcoGrid
Query
EcoGrid
Query
Layer
Integration
Layer
Integration
+A3
+A2
+A1
Sample
Data
Data
Calculation
Map
Generation
Validation
User
Validation
Map
Generation
Generate
Metadata
Archive
Native range prediction
Invasion
Digression:
(Business) Workflows and Systems
or: what you need to know when someone wants to sell you one ;-)
or: the remote relatives (2nd-3rd cousins?) of scientific workflows
Scientific Workflows, B. Ludaescher & I. Altintas
What is a (Business) Workflow?
Workflow management (also called Business Process Management) is the coordination of work processes through software.
A workflow management system routes pending activities to process participants according to a model of the process.
WF management systems have been around since the late 1970s (e.g. Officetalk, Xerox PARK)
marketing waves: Office Automation (70’s-80’s), Business Process Reengineering (90’s), Web Services Choreography (00’s)
roots/related: document management apps, email system apps, database apps (active DBMS’s, federated DBMS’s)
Meanwhile (69’-71’) elsewhere: Flow-based programming (J. Paul Morrison)
… not quite workflow but rather dataflow … (we’ll come to that…)
Src/cf: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003
Scientific Workflows, B. Ludaescher & I. Altintas
Some History
Scientific Workflows, B. Ludaescher & I. Altintas
Some History
Scientific Workflows, B. Ludaescher & I. Altintas
Play Time @ Petri Nets World
Petri Nets are the underlying abstract model of many B-WfMS’s (who said I can’t do bad acronyms, too? ;-)
http://www.daimi.au.dk/PetriNets/
http://www.daimi.au.dk/PetriNets/introductions/aalst/
Scientific Workflows, B. Ludaescher & I. Altintas
Formal Basis: Petri Nets
Mathematical model of discrete distributed systems (named after Carl Adam Petri, 1960’s)
Provides a modeling language w/ rich theory, analysis tools, …
A Petri net consists of places (P), transitions (T) and directed arcs (PT or TP). Places can hold tokens.
A transition is enabled if each of its input places contains at least one token.
An enabled transition can fire, removing input tokens and producing output tokens
P1
P2
P3
P4
T1
T2
Enabled
Formal Basis: Petri Nets
Mathematical model of discrete distributed systems (named after Carl Adam Petri, 1960’s)
Provides a modeling language w/ rich theory, analysis tools, …
A Petri net consists of places (P), transitions (T) and directed arcs (PT or TP). Places can hold tokens.
A transition is enabled if each of its input places contains at least one token.
An enabled transition can fire, removing input tokens and producing output tokens
P1
P2
P3
P4
T1
T2
Enabled
Why Petri Nets
Lots of analysis techniques, tools, theory
boundedness (state space),
safety (bad things do not happen),
reversibility,
deadlock(-freeness),
In a Flux: WS-XX-“Standards”
Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/
http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html
Everything Flows? But what exactly?
Dataflow
Activity diagrams: data flows through actions
Process networks: data flows between processes
Control-flow
Nodes are control-flow operations that start other operations on a state
Mixed approaches
Petri nets: tokens mark control and dataflow
Workflow languages: mix control and dataflow
… many others …
Scientific “Workflows” vs Business Workflows
Business Workflows (BPEL4WS* …)
Tasks, documents, etc. undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout
Complex control flow, complex process composition (danger of control flow/dataflow “spaghetti”)
Dataflow and control-flow are often divorced!
Scientific “Workflows”
Grid-aspects
Data, tool, and analysis integration
Dataflow and control-flow are often married! (can be a happy marriage… at times…)
*Business Process Execution Language for Web Services (in case you wondered)
Scientific Workflows, B. Ludaescher & I. Altintas
Scientific “Workflows”: Some Findings
Need for “programming extensions”
Need for abstraction and nested workflows
Need for data transformations (WS1DTWS2)
Need for rich user interaction & workflow steering:
pause / revise / resume
select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF
Need for high-throughput data transfers and CPU cyles: “(Data-)Grid-enabling”, “streaming”
Need for persistence of intermediate products and provenance
Scientific Workflows, B. Ludaescher & I. Altintas
Perspectives on Systems
/ Dataflow View
A Dataflow Component (“Actor”)
Actor-Oriented Design
Object orientation:
class name
data
methods
call
return
What flows through an object is sequential control (cf. CCA, MPI)
Actor/Dataflow orientation:
actor name
data (state)
Output data
What flows through an object is a stream of data tokens
(in SWFs/KEPLER also references!!)
Object-Oriented vs.
Actor-Oriented Interfaces
Actor/Dataflow
Oriented
AO interface definition says “Give me text and I’ll give you speech”
OO interface gives procedures that have to be invoked in an order not specified as part of the interface definition.
Object Oriented
TextToSpeech
Ptolemy II
History
Parallel schedulers
C/VHDL/DSP code generators
Optimizing SDF schedulers
PtPlot (1997-??)
KEPLER:
KEPLER = “Ptolemy II + X” for Scientific Workflows
Scientific Workflows, B. Ludaescher & I. Altintas
An “early” example: Promoter Identification SSDBM, AD 2003
Scientist models application as a “workflow” of connected components (“actors”)
If all components exist, the workflow can be automated/ executed
Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director)
Scientific Workflows, B. Ludaescher & I. Altintas
Why Ptolemy II (and thus KEPLER)?
Ptolemy II Objective:
“The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-defined models of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.”
Dataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuse
User-Orientation
“Application/Glue-Ware”
run-time support, monitoring, …
not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …)
but middle-/underware is conveniently accessible through actors!
PRAGMATICS
Ptolemy II is mature, continuously extended & improved, well-documented (500+pp)
open source system
Scientific Workflows, B. Ludaescher & I. Altintas
The KEPLER/Ptolemy II GUI (Vergil)
“Directors” define the component interaction & execution semantics
Large, polymorphic component (“Actors”) and Directors libraries (drag & drop)
Scientific Workflows, B. Ludaescher & I. Altintas
Ptolemy II: Actor-Oriented Modeling
Component (“actor”) interaction semantics not hard-wired inside components, but “factored out” in a “director”
Different directors for different modeling and execution needs (… can even be combined!)
Better abstraction, modeling, component reuse, …
Scientific Workflows, B. Ludaescher & I. Altintas
Behavioral Polymorphism in Ptolemy
These polymorphic methods implement the communication semantics of a domain in Ptolemy II. The receiver instance used in communication is supplied by the director, not by the component.
(cf. CCA, WS-??, [G]BPL4??, … !)
producer
actor
consumer
actor
IOPort
Receiver
Behavioral polymorphism is the idea that components can be defined to operate with multiple models of computation and multiple middleware frameworks.
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
Director
Domains and Directors: Semantics for
Component Interaction
CT – continuous-time modeling
DE – discrete-event systems
Giotto – synchronous periodic
PN – process networks
SDF – synchronous dataflow
For (coarse grained) Scientific Workflows!
For (finer-grained) concurrent jobs!?
Polymorphic Actor Components Working Across Data Types and Domains
Actor Data Polymorphism:
Add strings (concatenation)
Add user-defined types
Actor Behavioral Polymorphism:
In dataflow, add when all connected inputs have data
In a time-triggered model, add when the clock ticks
In discrete-event, add when any connected input has data, and add in zero time
In process networks, execute an infinite loop in a thread that blocks when reading empty inputs
In CSP, execute an infinite loop that performs rendezvous on input or output
In push/pull, ports are push or pull (declared or inferred) and behave accordingly
In real-time CORBA, priorities are associated with ports and a dispatcher determines when to add
By not choosing among these when defining the component, we get a huge increment in component re-usability. But how do we ensure that the component will work in all these circumstances?
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
Scientific Workflows, B. Ludaescher & I. Altintas
Directors and Combining Different Component Interaction Semantics
Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Possible app. in SWF:
Component Composition & Interaction
each component is its own director!
But still useful for special applications, e.g. parallel programs (MPI, …)
Source: GRIST/SC4DEVO workshop, July 2004, Caltech
DIR1
DIR2
DIR3
DIR4
CCA via special (“look the other way”) Director(s)?
Dataflow in CCA
a CCA “convention” can be used to accommodate actor-oriented/dataflow modeling
CCA/Message Passing in KEPLER
Kepler/Ptolemy can be extended to accommodate message passing semantics (CSP is already in Ptolemy II)
CCA!?
Data/Control-Flow Spectrum
WYSIWYG (usually)
References flow
generic handling still possible
Application specific tokens flow
“invisible contract” between components
Director is unaware of what’s going on … (sounds familiar? ;-)
Specific messages passing protocols (e.g., CSP, MPI)
for systems of tightly coupled components
“clean” data(=ctl)-flow
special tokens flow
KEPLER/CSP: Contributors, Sponsors, Projects
Ilkay Altintas SDM, Resurgence
Kim Baldridge Resurgence, NMI
Zhengang Cheng SDM
Dan Higgins SEEK
Efrat Jaeger GEON
Matt Jones SEEK
Werner Krebs, EOL
Kai Lin GEON
Mark Miller EOL
Steve Mock NMI
KEPLER: An Open Collaboration
Initiated by members from NSF SEEK and DOE SDM/SPA; now several other projects
Open Source (BSD-style license)
get a CVS account (read-only)
local development + contribution via existing KEPLER member
be voted “in” as a member/co-developer
Software & social engineering
How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?
Scientific Workflows, B. Ludaescher & I. Altintas
GEON Dataset Generation & Registration
(a co-development in KEPLER)
% Makefile
KEPLER then …
… and KEPLER today…
… so,you see,
scientific workflows need domain and data-polymorphic actors & must scale to HPC!
Scientific Workflows, B. Ludaescher & I. Altintas
KEPLER Pedigree (to be determined…)
Ptolemy
KEPLER
A Few Specific Kepler Features
Scientific Workflows, B. Ludaescher & I. Altintas
Web Services Actors
Similarly: MM workflow design & sharing w/o implemented components
Scientific Workflows, B. Ludaescher & I. Altintas
Recent Actor Additions
Digression: Who are the clients?
Domain scientists
C/Perl/Python/Java/WS/DB-enabled ones
Goal: make the life better for both!
Workflow automation
Plumbing support
For the Geoscientist:
GEON Mineral Classification Workflow
This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine.
The ModalData provides the mineral info.
Scientific Workflows, B. Ludaescher & I. Altintas
… inside the Classifier
BrowserUI actor w/ SVG client display
This triangular diagram for classification and nomenclature of gabbroic rock is chosen for a specific point according to the values it has in the ModalData DB. It is used when the point has values for Plagioclase, Pyroxene and Olivine.
The ModalData provides the mineral info.
Scientific Workflows, B. Ludaescher & I. Altintas
in KEPLER (interactive session)
Source: Dan Higgins, Kepler/SEEK
in KEPLER (w/ editable script)
Source: Dan Higgins, Kepler/SEEK
A Closer Look at Dataflow …
(or: Do you know what’s going on under your carpet? )
control tokens flow, e.g., from “$”-actor to FileReader and ImageReader actors
actual dataflow is “under the carpet” and through handles (file system, GridFTP, scp, SRB, …)
Dataflow: what you see is what you get (almost…)
Need for a general way to handle references!
Scientific Workflows, B. Ludaescher & I. Altintas
GEON Data Registration UI
GEON Data Registration in KEPLER
Scientific Workflows, B. Ludaescher & I. Altintas
Registered Resources show up in Vergil (joint SEEK, SPA, GEON, … Registry!?)
Scientific Workflows, B. Ludaescher & I. Altintas
Data Analysis: Biodiversity Indices
Scientific Workflows, B. Ludaescher & I. Altintas
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Scientific Workflows, B. Ludaescher & I. Altintas
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Scientific Workflows, B. Ludaescher & I. Altintas
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Scientific Workflows, B. Ludaescher & I. Altintas
Re-engineered PIW w/ Iteration Constructs AD 2004
map(GenbankWS)
Scientific Workflows, B. Ludaescher & I. Altintas
Streaming Real-time Data
ORB
Job Management (here: NIMROD)
Results database: under development
Goal: 1000’s of GAMESS jobs (quantum mechanics) – Fall/Winter’04
Scientific Workflows, B. Ludaescher & I. Altintas
KEPLER Today
Design, share, prototype, run, monitor, deploy, …
Coarse-grained scientific workflows, e.g.,
Fine grained workflows and simulations, e.g.,
Database access, XSLT transformations, …
real-time data streaming (ROADNet)
Status
nightly builds w/ version tests
“Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …)
Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)
Scientific Workflows, B. Ludaescher & I. Altintas
KEPLER Tomorrow
Application-driven extensions:
SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, …
support for execution of new SWF domains
Astrophysics: TSI/Blondin (SPA/NCSU)
(C-z; bg; fg)-ing (“detach” and reconnect)
workflow deployment models
time series, parameter sweeps, job scheduling, …
hybrid type system with semantic types
Consolidation
Desiderata for and Features of Scientific Workflow Automation
SWF design support
better component reuse through actor-oriented modeling w/ (largely) independent directors
Rapid prototyping support
Shell/command line actor
Workflow “plumbing” support
Runtime support
Execution monitoring
animation for SDF, planned “heartbeat” for PN, …
listening to and logging of token flow through ports and control messages of directors
Pause-inspect-modify-resume cycle
F I N
Additional material ahead
Research (and Development) Issues
…some challenges and ideas…
“Service Composition, Orchestration” and all that stuff
Instead of asking which WS-XXX solves this for you, ask: What is my WF composition problem?
Also: there is a good amount of previous work, most notably from the Ptolemy group itself:
How do you model systems as interacting components
How do you model component interaction

“Programming Patterns”
Scientific Workflows, B. Ludaescher & I. Altintas
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Scientific Workflows, B. Ludaescher & I. Altintas
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Scientific Workflows, B. Ludaescher & I. Altintas
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Scientific Workflows, B. Ludaescher & I. Altintas
hand-crafted control solution; also: forces sequential execution!
designed to fit
designed to fit
A Scientific Workflow Problem:
Solution based on declarative, functional dataflow process network
(= also a data streaming model!)
Higher-order constructs: map(f)
no control-flow spaghetti
PIW :=map(piw) over [GeneId]
A Scientific Workflow Problem:
map(GenbankWS)
Scientific Workflows, B. Ludaescher & I. Altintas
A Research Problem:
Optimization by Rewriting
optimization via functional rewriting possible
e.g. map(f o g) = map(f) o map(g)
Technical report &PIW specification in Haskell
map(f o g) instead of map(f) o map(g)
Combination of map and zip
http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
More Research…
Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney
Scientific Workflows, B. Ludaescher & I. Altintas
KEPLER Today
Design, share, prototype, run, monitor, deploy, …
Coarse-grained scientific workflows, e.g.,
Fine grained workflows and simulations, e.g.,
Database access, XSLT transformations, …
real-time data streaming (ROADNet)
Status
nightly builds w/ version tests
“Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …)
Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)
Scientific Workflows, B. Ludaescher & I. Altintas
KEPLER Tomorrow
Application-driven extensions:
SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, …
support for execution of new SWF domains
Astrophysics: TSI/Blondin (SPA/NCSU)
(C-z; bg; fg)-ing (“detach” and reconnect)
workflow deployment models
time series, parameter sweeps, job scheduling, …
hybrid type system with semantic types
Consolidation
More installers, regular releases, improved documentation, …
Scientific Workflows, B. Ludaescher & I. Altintas
Towards a more concise Presentation Style …
Due to lack of time, some slides will be “by reference” only ;-)
…Each speaker was given four minutes to present his paper, as there were so many scheduled -- 198 from 64 different countries. To help expedite the proceedings, all reports had to be distributed and studied beforehand, while the lecturer would speak only in numerals, calling attention in this fashion to the salient paragraphs of his work. ... Stan Hazelton of the U.S. delegation immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only 22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that matter; Hazelton countered this objection with the crushing retort that, either way, 22. I turned to the number key in his paper and discovered that 22 meant the end of the world… [The Futurological Congress, Stanislaw Lem, translated from the Polish by Michael Kandel, Futura 1977]
Scientific Workflows, B. Ludaescher & I. Altintas
References
http://c2.com/cgi/wiki?FlowBasedProgramming
http://c2.com/cgi/wiki?DataflowProgramming
http://c2.com/cgi/wiki?ActorsModel
«Interface»
Receiver