An Introduction to Scientific Workflows

17
An Introduction to Scientific Workflows Dr. Shiyong Lu [email protected] Department of Computer Science Wayne State University

description

Dr. Shiyong Lu [email protected] Department of Computer Science Wayne State University. An Introduction to Scientific Workflows. The scientific workflow paradigm. Workflows are used to automate various data analysis tasks, which might produce further data; - PowerPoint PPT Presentation

Transcript of An Introduction to Scientific Workflows

Page 1: An Introduction to Scientific Workflows

An Introduction to Scientific Workflows

Dr. Shiyong Lu

[email protected]

Department of Computer Science

Wayne State University

Page 2: An Introduction to Scientific Workflows

The scientific workflow paradigm

• Workflows are used to automate various data analysis tasks, which might produce further data;

• Provenance is captured automatically to record the history of workflow evolution and data derivation so that data and workflows can be reproduced when necessary.

Page 3: An Introduction to Scientific Workflows

Growing areas of scientific workflow applications

Bioinformatics Neuroinformatics

Oceanography Astronomy

Page 4: An Introduction to Scientific Workflows

What is a scientific workflow?

• A scientific workflow is a formal specification of a scientific process, which represents, streamlines, and automates the steps from dataset selection and integration, computation and analysis, to final data product presentation and visualization.

• An artifact on its own for scientists to patent, reuse, to publish, and to share (myexperiment.org).

• Who-discovers-first might become who-comes-up-with-the-right-scientific-workflow-first!

Page 5: An Introduction to Scientific Workflows

A scientific workflow example

(C. Lin, et al. 2008)

Page 6: An Introduction to Scientific Workflows

A more complex scientific workflow example

(Alhayafi et al., 2008)

Page 7: An Introduction to Scientific Workflows

What is a Scientific Workflow Management System?

• A scientific workflow management system (SWFMS) is a system that supports the specification, modification, run, re-run, and monitoring of scientific workflows.

• Supports a high-level workflow specification language (e.g., WSL/TSL, XSCUFL)

• An end user describes their workflows using that language, typically with a graphical workflow designer.

• SWFMS interprets a workflow of that language by a coordinated execution of the tasks in the workflow.

Page 8: An Introduction to Scientific Workflows

Major components of a SWFMS

(Lin et al., 2008)

Page 9: An Introduction to Scientific Workflows

Workflow design• Workflow design can be performed with the

assistance of any workflow design tool, typically with a graphical user interface for the ease of manipulation by scientists;

• The resulting scientific workflow is usually represented in a scientific workflow specification language (e.g., SWL, XSCUFL, MOML).

• A standard scientific workflow language has yet to appear, one major reason for poor interoperability among SWFMSs today.

Page 10: An Introduction to Scientific Workflows

Workflow enactment

• The workflow engine performs workflow enactment - creates a workflow case and schedules task invocation in an order according to the workflow logic.

• The movement of controlflows and dataflows;

• Workflow status management;• Provenance collection: workflow evolution

and data derivation history.

Page 11: An Introduction to Scientific Workflows

Task management• It is responsible for the resource

provisioning, scheduling, and monitoring of task execution;

• Abstractions of various local and remote heterogeneous services and software tools as workflow tasks;

• Abstraction of a subworkflow as a composite task (with internal implementation hidden);

• Registration, annotation, and searching of tasks.

Page 12: An Introduction to Scientific Workflows

Data product management• Responsible for the management of large

amount of source, intermediate, and final data products of the execution of scientific workflows.

• Registration, annotation, searching, replicating of data products for reuse, publishing, sharing, presentation, and visualization.

• Representation and location transparency: abstract various heterogeneous and distributed data sets as data products.

• Petascale data product system is needed for large-scale data-intensive scientific workflows.

Page 13: An Introduction to Scientific Workflows

Provenance management

•Scientific workflow provenance is one kind of metadata that captures the derivation history of a data product, including the original data sources, intermediate data products, and the workflow tasks that were applied to produce a data product.

•Capturing provenance is critical for scientific workflows to support reproducibility, result interpretation, and problem diagnosis.

•Provenance management concerns about the efficiency and effectiveness of recording, representing, storing, querying, and visualizing provenance.

Page 14: An Introduction to Scientific Workflows

Workflow monitoring

•The monitoring of the progress and status of the execution of scientific workflows is very important, particularly for long-running scientific workflows.

•Scientific workflows can be dynamically changed by end-users and can orchestrate heterogeneous services over unreliable networks, many exceptions and failures might occur.

•The complexity and scale of data analysis and computation in scientific workflows impose additional challenges on workflow monitoring and failure handling.

Page 15: An Introduction to Scientific Workflows

Scientific workflows vs. business workflows: a user’s perspective

1. Dataflow-oriented vs. controlflow-oriented.

2. Reproducible vs. non-reproducible.

3. Data-centric vs. business-centric => data parallelism vs. concurrency control (ACID)?

4. Scalable vs. correct. Scientific workflows uses a trial-and-error approach and an error is also “a success” on its own, and scalability is the concern; business workflows cannot bear errors, which often result in economical loss.

5. Explicit vs. implicit.

6. Mutable vs. immutable.

7. Static vs. dynamic binding to resources.

Page 16: An Introduction to Scientific Workflows

Scientific workflows vs. business workflows: an architectural perspective

(Lin, et al., 2008)(Hollingsworth, 1995)

Business Workflows Scientific Workflows

Page 17: An Introduction to Scientific Workflows

Scientific workflows vs. business workflows: a workflow language perspective

•Visual programming-in-the large or not?

•Datalfow programming model vs. imperative programming model

•Dataflow constructs vs. controlflow constructs

•Hierarchical workflow composition

•Single assignment property?

•Physical and logical data models

•Task and workflow level exception handling