Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler)

Click here to load reader

download Towards Scientific Workflows Based on Dataflow Process Networks  (or  from Ptolemy to Kepler)

of 43

  • date post

    02-Feb-2016
  • Category

    Documents

  • view

    20
  • download

    0

Embed Size (px)

description

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler). Bertram Lud ä scher San Diego Supercomputer Center ludaesch@SDSC.edu. A Note on the Style of the following Slides. Due to lack of time, most of the following slides will be “by reference” only ;-) - PowerPoint PPT Presentation

Transcript of Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler)

Scientific Workflows(or from Ptolemy to Kepler)
Bertram Ludäscher
A Note on the Style of the following Slides
Due to lack of time, most of the following slides will be “by reference” only ;-)
…Each speaker was given four minutes to present his paper, as there were so many scheduled -- 198 from 64 different countries. To help expedite the proceedings, all reports had to be distributed and studied beforehand, while the lecturer would speak only in numerals, calling attention in this fashion to the salient paragraphs of his work. ... Stan Hazelton of the U.S. delegation immediately threw the hall into a flurry by emphatically repeating: 4, 6, 11, and therefore 22; 5, 9, hence 22; 3, 7, 2, 11, from which it followed that 22 and only 22!! Someone jumped up, saying yes but 5, and what about 6, 18, or 4 for that matter; Hazelton countered this objection with the crushing retort that, either way, 22. I turned to the number key in his paper and discovered that 22 meant the end of the world… [The Futurological Congress, Stanislaw Lem, translated from the Polish by Michael Kandel, Futura 1977]
SEEK meeting, UCSB, 10/22-26/2003
www.nbirn.net
seek.ecoinformatics.org
sdm.lbl.gov/sdmcenter/
From: SciDAC/SDM project and collaboration w/ Matt Coleman (LLNL)
SEEK meeting, UCSB, 10/22-26/2003
Details of the Functional MRI (Magnetic Resonance Imaging) Analysis Workflow (Jeffrey Grethe)
Collect data (K-Space images in Fourier space) from MR scanner while subject performs a specific task
Reconstruct K-Space data to image data (this requires scanner parameters for the reconstruction)
Now have anatomical and functional data
Pre-process the functional data
Correct for difference in slice acquisition (each slice in a volume is collected at a slightly different time). Try to correct for these differences so that all slices seem to be acquired at same time
Not correct for subject motion (head movement in scanner) by realigning all functional images
Register the functional images with the anatomical image all images are now in the same space (all aligned with one another)
Move all subjects into template space through non-linear spatial normalization. There exist atlas templates (made from many subjects) that one can normalize to so that all subjects are in the same space, allowing for direct comparison across subjects.
DATA VERIFICATION - check if all these procedures worked. If not, go back and try again (possibly tweaking some parameters for the routines or by re-doing some of it by hand).
Move onto statistics. First we do single subject statistics: in addition to the images, information about the experimental paradigm is required. These can be overlayed onto an anatomical to create visual displays of brain activation during a particular task.
Can also combine statistical data from multiple subjects and do a group/population analysis and display these results.
Interactive nature of these workflows is critical (data verification) - can these steps be automated or semi-automated?
need metadata from collection equipment and experimental design !
SEEK meeting, UCSB, 10/22-26/2003
GARP Invasive Species Pipeline
EcoGrid
Query
EcoGrid
Query
Layer
Integration
Layer
Integration
+A3
+A2
+A1
Sample
Data
Data
Calculation
Map
Generation
Validation
User
Validation
Map
Generation
Generate
Metadata
Archive
Native range prediction
Invasion
not: documents/objects undergoing modifications
Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …)
Need for abstraction and nested workflows
Need for data transformations (compute/transform alternations)
Need for rich user interaction / steering:
pause & resume
select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF
Need for high-throughput transfers (“grid-enabling”, “streaming”)
Need for persistence of intermediate products
data provenance (“virtual data”; cf. several ITR and e-Science projects)
SEEK meeting, UCSB, 10/22-26/2003
(Analytical) Pipelines …. (Scientific) Workflows
Spectrum of languages & formalisms:
Pipelines (a la Unix)
“Web page-flow”:
SEEK meeting, UCSB, 10/22-26/2003
lots of standards to choose from: WfMC, BMPL, BPEL4WS,.. XPDL,…
but often no clear semantics for constructs as simple as this:
Source: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002
SEEK meeting, UCSB, 10/22-26/2003
http://tmitwww.tm.tue.nl/research/patterns/
Business WF
Complex control flow, task-oriented

SWF
data-in and data-out of an analysis step are not the same object!
dataflow, data-oriented (cf. AVS/Express, Khoros, …)
re-run automatically (a la distrib. comp., e.g. Condor) or user-driven/interactively (based on failure type)
data integration & semantic mediation as part of SWF framework!

Batch oriented
HPC resource allocation & scheduling
Often highly interactive for decision making/steering of the WF and visualization (data analysis)
Transparent data access (Grid) and integration (database mediation & semantic extensions)
Desktop metaphor (“microworkflow”!?); often (but not always!) light-weight web service invocation
SEEK meeting, UCSB, 10/22-26/2003
Recommendations following:
must read
must see (now: snippets following; watch for new ways to compress slides ;-)
must try
Bottom line:
a sophisticated system to do “simple” things (dataflows) as well as highly complex things (hybrid models)
(compare to your favorite standard/approach/system)
SEEK meeting, UCSB, 10/22-26/2003
see!
try!
read!
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
In our (SEEK) terminology:
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
Kahn Process Networks (PN)
Concurrent processes communication through one-way FIFO channels with unbounded capacity
A functional process F maps a set of input sequences into a set of output sequences (sounds like XSM!)
increasing chain of sets of sequences outputs may not increase!
Consider increasing chains (wrt. prefix ordering “<“) of streams
PN is continuous if lub(Xs) exists for all increasing chains Xs and
F(lub(Xs)) < lub(F(Xs))
SEEK meeting, UCSB, 10/22-26/2003
Process Networks (cont’d)
Network of functional processes can be described by a mapping
X = F(X,I)
X denotes all the sequences in the network (inputs I+outputs)
X that forms a solution is a fixed point
Continuity implies exactly one “minimal” fixed point
minimal in the sense of pre-fix ordering for any inputs I
execution of the network: given I = ^ and find the minimal fixed point (works because of the monotonic property)
SEEK meeting, UCSB, 10/22-26/2003
Special case of PN
Ptolemy-II SDF overview
SDF supports efficient execution of Dataflow graphs that lack control structures
with control structures Process Networks(PN)
requires that the rates on the ports of all actors be known before hand
do not change during execution
in systems with feedback, delays, which are represented by initial tokens on relations must be explicitly noted SDF uses this rate and delay information to determine the execution sequence of the actors before execution begins.
SEEK meeting, UCSB, 10/22-26/2003
Extended Kahn-MacQueen Process Networks
A process is considered active from its creation until its termination
An active process can block when trying to read from a channel (read-blocked), when trying to write to a channel (write-blocked) or when waiting for a queued topology change request to be processed (mutation-blocked)
A deadlock is when all the active processes are blocked
real deadlock: all the processes are blocked on a read
artificial deadlock: all processes are blocked, at least one process is blocked on a write increase the capacity of receiver with the smallest capacity amongst all the receivers on which a process is blocked on a write. This breaks the deadlock.
If the increase results in a capacity that exceeds the value of maximumQueueCapacity, then instead of breaking the deadlock, an exception is thrown. This can be used to detect erroneous models that require unbounded queues.
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
Kepler = current Ptolemy-II plus X, where X = …
Extended type system (structural & semantic extensions)
Collection programming extensions (declarative/FP) and
Rich user interactions/workflow steering
(Eco-)Grid extensions:
Data and service repositories, discovery
Data provenance
… minus upcoming Ptolemy-II extensions!
The slower we are, the less we have to do ourselves ;-)
SEEK meeting, UCSB, 10/22-26/2003
SemType m1 ::
Observation & itemMeasured.AbundanceCount &
SEEK meeting, UCSB, 10/22-26/2003
See why we said user-definable (or auto-generated) actor libraries?
SEEK meeting, UCSB, 10/22-26/2003
SEEK meeting, UCSB, 10/22-26/2003
Complex backward control-flow
genBankG :: GeneId -> GeneSeq
genBankP :: PromoterId -> PromoterSeq
blast :: GeneSeq -> [PromoterId]
promoterRegion :: PromoterSeq -> PromoterRegion
transfac :: PromoterRegion -> [TFBS]
d1 = genBankG d0 -- get its gene sequence from GenBank
d2 = blast d1 -- BLAST to get a list of potential promoters
d3 = map genBankP d2 -- get list of promoter sequences
d4 = map promoterRegion d3 -- compute list of promoter regions and ...
d5 = map transfac d4 -- ... get transcription factor binding sites
d6 = zip d2 d4 -- create list of pairs promoter-id/region
d7 = map gpr2str d6 -- pretty print into a list of strings
d8 = concat d7 -- concat into a single "file"
d9 = putStr d8 -- output that file
SEEK meeting, UCSB, 10/22-26/2003
Simplified Process Network PIW
(= a data streaming model!)
no control-flow spaghetti
free concurrent execution
free type checking
automatic support to go from piw(GeneId) to PIW :=map(piw) over [GeneId]
map(f)-style
iterators
PIW as a declarative, referentially transparent functional process
optimization via functional rewriting possible
e.g. map(f o g) = map(f) o map(g)
Details:
map(f o g) instead of map(f) o map(g)
Combination of map and zip
http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf
Rewritings require that data transformation semantics is known
e.g., Haskell-like for FP and SQL (XQuery)-like for (XML) database querying
Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney
SEEK meeting, UCSB, 10/22-26/2003
… (NOT)
454.bin
(Semi-)automatic
… a hot topic
(e.g., AI-style planning born-again; functional composition; query composition; … )
… a separate topic
Flow-based Programming, http://www.jpaulmorrison.com/fbp/index.shtm