UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies...

24
UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow System Ilkay Altintas, Assistant Director, National Laboratory for Advanced Data Research Manager, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, University of California, San Diego Oscar Barney, Scientific Computing and Imaging Institute, The University of Utah Efrat Jaeger-Frank, San Diego Supercomputer Center, University of California, San Diego

Transcript of UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies...

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Provenance Collection Support in the Kepler Scientific

Workflow System Ilkay Altintas, Assistant Director, National Laboratory for Advanced Data Research

Manager, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, University of California, San Diego

Oscar Barney, Scientific Computing and Imaging Institute, The University of Utah

Efrat Jaeger-Frank, San Diego Supercomputer Center, University of California, San Diego

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

What is a scientific workflow?

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

What does the user want?

• “To get work done” and “Make hard things easy”

• How to do this?1.Combine tools with disparate strengths

2.Make them work efficiently3.Focus on interfaces4.Enable consistent user interfaces

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Real-time Weather Sensor Data Display Workflow

Basic Steps1. Get Real Time Weather Data from the ORB 2. Convert this data into a visualizable/ graphical plot via image

manipulation tools such as JAI,Java2D/3D,Gnuplot or Matlab.3. Display the above weather plot Images in Kepler.4. Refresh the images produced so as to reflect the most recent data.

A very basic pipeline!

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Promoter Identification Workflow

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Scientific Workflow is a Set of Steps To…

• Combine different CI technologies– To promote “scientific discovery” by providing tools and methods to generate scientific workflows• Often through an extensible and customizable graphical user interface – For scientists from different scientific domains

– To support computational experiment creation, execution, sharing, reuse and provenance

– To connect to the existing data and integrate heterogeneous data from multiple resources in efficient ways provided by a scientific workflow system

– To bring CI into user’s monitor!!!

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Why do we need to track provenance

in a scientific workflow system?

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Because science is an evolving process…

“A process cannot be understood by stopping it. Understanding must move with the flow of the process, must join it and flow with it.”

(First Law of Mentat), Frank Herbert, Dune.

• Recreate results and rebuild workflows using the evolution information

• Associate the workflow with the results it produced

• Create links between generated data in different runs, and compare different runs

• Recover from a system failure– Checkpoint a workflow– Debug and explain results (via lineage tracing, …)

• Smart Reruns

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Ptolemy II: A laboratory for investigating design

KEPLER: A problem-solving environment for Scientific Workflow

KEPLER = “Ptolemy II + X” for Scientific Workflows

Kepler is a Scientific Workflow System

• … and a cross-site collaboration• 1st Beta release (Out next week…)

www.kepler-project.orgwww.kepler-project.org

• Builds upon the open-source Ptolemy II framework

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Kepler is a Team Effort

Ptolemy IIPtolemy II

Resurgence

Griddles

SRB

LOOKING

BIRN

Cipres NLADR Contributor names and funding info are at the Kepler website!!

Other contributors: - Chesire (UK Text Mining Center) - DART (Great Barrier Reef, Australia) - National Digital Archives + UCSD-TV (US)

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Vergil is the GUI for Kepler

• Actor ontology and semantic search for actors• Search -> Drag and drop -> Link via ports• Metadata-based search for datasets

Actor Search

Data Search

Director

Actor

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Actor Search

• Kepler Actor Ontology• Used in searching actors and creating conceptual views (= folders)

Currently 160 Kepler actors added!

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Data Search and Usage of Results

• Kepler DataGrid– Discovery of data resources through local and remote services

SRB, Grid and Web Services, Db connections

– Registry of datasets on the fly using workflows

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Kepler System Architecture

Authentication

GUI

Vergil

SMS

KeplerCore

ExtensionsPtolemy

…Kepler GUI Extensions…

Actor&DataSEARCH

TypeSystem

Ext

ProvenanceFramework

KeplerObject

Manager

Documentation

Smart Re-run /Failure

Recovery

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Initial Work on the Provenance Framework

• OPTIONAL!– Modeled as a separate concern in the system – Listens to the execution and saves information customized by a set of parameters• Context: who, what, where, when, and why that is associated with the run

• Input data and its associated metadata• Workflow outputs and intermediate data products• Workflow definition (entities, parameters, connections): a specification of what exists in the workflow and can have a context of its own

• Information about the workflow evolution -- workflow trail

• Types of Provenance Information:– Data provenance

• Intermediate and end results including files and db references– Process provenance

• Keep the wf definition with data and parameters used in the run

– Error and execution logs– Workflow design provenance

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Kepler Provenance Recording Utility• Parametric and

customizable – Different report formats– Variable levels of

detail• Verbose-all, verbose-some, medium, on error

– Multiple cache destinations

• Saves information on– User name, Date, Run,

etc…

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

What other system functions does provenance relate to in

Kepler?

• Failure recovery• Smart re-runs• Semantic extensions• Kepler Data Grid• Reporting and Documentation• Authentication• Data registration

Re-run only the updated/failed parts

Guided documentation generation an updates

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

“Smart” Re-runs

•Instead of running a workflow from scratch, only re-run parts of the workflow that have not been done before– Example: Change a parameter downstream and don’t re-run the actors that lead up to the one with the parameter change

•Especially useful: – In visualization pipelines – Long running workflows

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

“Smart” Re-runs

• Uses VisTrails’ cache manager algorithm*• Idea:

– To re-run as little of the network as possible by combining intermediate results from different workflow runs

– Past results stored in a provenance store (currently cache)

• Queries and recreates input to actors that need to be re-fired

* L. Bavoil, et al. VisTrails: Enabling Interactive Multiple-View Visualizations. IEEE Visualization, 2005.

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

What is needed for “Smart” Re-runs?

• Need to keep track of – what have done before– what actors have been given

•what inputs with what outputs

• Uses the stored provenance data– From the cache

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Next Steps• Deciding on terms and definitions for all Kepler

• A relational schema for the provenance info in addition to the existing XML

• Collect data/metadata in different formats• .kar file generation, registration and search for provenance information

• Adding provenance repositories• Automatic report generation from accumulated data

• A GUI to keep track of the changes• Continue work on “Smart” Re-runs system

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

To Sum Up

• … is an open-source system and collaboration

• Kepler provenance framework and smart rerun manager are in their initial steps– Aims to support different scientific domains•Saving data in different repositories and metadata formats

– Successful results in the initial runs

• Short demonstration…

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

Ilkay Altintas

Scientific Workflow Automation Technologies

Ilkay [email protected]+1 (858) 822-5453http://www.sdsc.edu

Questions…

Thanks!