PhD Thesis: Mining abstractions in scientific workflows

86
Date: 03/12/2015 Mining Abstractions in Scientific Workflows Daniel Garijo * Supervisors: Oscar Corcho *, Yolanda Gil Ŧ * Universidad Politécnica de Madrid, Ŧ USC Information Sciences Institute

Transcript of PhD Thesis: Mining abstractions in scientific workflows

Page 1: PhD Thesis: Mining abstractions in scientific workflows

Date: 03/12/2015

Mining Abstractions in Scientific Workflows

Daniel Garijo *Supervisors: Oscar Corcho *, Yolanda Gil Ŧ

* Universidad Politécnica de Madrid,Ŧ USC Information Sciences Institute

Page 2: PhD Thesis: Mining abstractions in scientific workflows

Introduction

Lab book

Digital Log

Laboratory Protocol (recipe)

Scientific Workflow

Experiment

In silico experiment

2PhD Thesis: Mining Abstractions in Scientific Workflows

Page 3: PhD Thesis: Mining abstractions in scientific workflows

Benefits of workflows

Time savings• Copy & paste fragments of workflows

3PhD Thesis: Mining Abstractions in Scientific Workflows

Teaching• Reduce the learning curve of new students

Visualization• Simplify workflows

Design for modularity• Highlight the most relevant steps on a workflow

Design for standardizationDebugging

• Provenance exploration

Reproducibility and inspectability

Page 4: PhD Thesis: Mining abstractions in scientific workflows

Motivation of this work

Workflow Repositories

Workflow SystemsLet’s

Share!

I want to reuse…

?

I want to understand…?

I want to repurpose…

?

4PhD Thesis: Mining Abstractions in Scientific Workflows

Page 5: PhD Thesis: Mining abstractions in scientific workflows

Open research challenges

•Workflow representation heterogeneity

5PhD Thesis: Mining Abstractions in Scientific Workflows

Workflow Repositories

How can we represent a description of workflows and their metadata?

How can we facilitate the homogeneous consumption of workflows and their resources?

Page 6: PhD Thesis: Mining abstractions in scientific workflows

Open research challenges

•Workflow representation heterogeneity

6PhD Thesis: Mining Abstractions in Scientific Workflows

•Inadequate level of workflow abstraction

What are the most relevant parts of a workflow

Dataset

PorterStemmer

Result

IDF

FinalResult

Dataset

LovinsStemmer

Result

ResidualIDF

FinalResult

Dataset

Stemmer

Result

Term Weighting

FinalResult

Are two seemingly disparate workflows related at a higher level of abstraction?

Page 7: PhD Thesis: Mining abstractions in scientific workflows

Open research challenges

•Workflow representation heterogeneity

7PhD Thesis: Mining Abstractions in Scientific Workflows

•Inadequate level of workflow abstraction

•Difficulties for workflow reuse

How is a workflow related to other workflows?

Which workflow (parts) are potentially useful for reuse?

???

Page 8: PhD Thesis: Mining abstractions in scientific workflows

Open research challenges

•Workflow representation heterogeneity

8PhD Thesis: Mining Abstractions in Scientific Workflows

•Inadequate level of workflow abstraction

•Difficulties for workflow reuse

•Lack of support for workflow annotation

+ +

How can we facilitate the annotation process?

Page 9: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and contributions

3. Workflow representation: Open Provenance Model for Workflows

4. Workflow abstraction and reuse

5. Mining abstractions from workflows using graph mining techniques

6. Evaluation

7. Conclusions and future work

9PhD Thesis: Mining Abstractions in Scientific Workflows

Page 10: PhD Thesis: Mining abstractions in scientific workflows

•H.3: Commonly occurring patterns are potentially useful for users designing workflows.

•H.2: It is possible to detect commonly occurring patterns and abstractions automatically.

Hypothesis

•H.1: It is possible to define a catalog of common domain independent patterns based on the common functionality of workflow steps.

Scientific workflow repositories can be automatically analyzed to extract commonly occurring patterns and abstractions that are useful for workflow developers aiming to reuse existing workflows.

Workflow abstraction

Workflow representation

Workflow reuse

Workflow annotation

Workflow reuse

10PhD Thesis: Mining Abstractions in Scientific Workflows

Page 11: PhD Thesis: Mining abstractions in scientific workflows

Contributions

Workflow representation and publication

Model for representing workflow templates and executions

Workflow abstraction

Methodology to publish workflows in the web

Workflow annotationA model and means for annotating semi-automatically the abstractions in

workflows

A catalog of common domain independent workflow patterns based on the functionality of workflow steps

A method to extract generic commonly occurring workflow fragments automatically

Workflow reuseMetrics for assessing the usefulness of a fragment for reuse

A model to describe and annotate workflow fragments

11PhD Thesis: Mining Abstractions in Scientific Workflows

OPMW

Linked Data

Wf-motifsWf-fd

Workflow motifs

Graph mining

Page 12: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and contributions

3. Workflow representation: Open Provenance Model for Workflows a) Requirementsb) The OPMW modelc) Publishing workflows as Linked Data

4. Workflow abstraction and reuse

5. Mining abstractions from workflows using graph mining techniques

6. Evaluation

7. Conclusions and future work12PhD Thesis: Mining Abstractions in Scientific Workflows

Page 13: PhD Thesis: Mining abstractions in scientific workflows

Workflow representation: Structures interchanged in the workflow lifecycle

Dataset

Stemmeralgorithm

Result

Term weightingalgorithm

FinalResult

File: Dataset123

LovinsStemmeralgorithm

Id:resultaa1

IDFalgorithm

Id:fresultaa2

WorkflowTemplate

13PhD Thesis: Mining Abstractions in Scientific Workflows

Workflow Instance Workflow Execution Trace

Design Instantiation Execution

File: Dataset124

PorterStemmeralgorithm

Id:resultaa1

IDFalgorithm

Id:fresultaa2

File: Dataset123

LovinsStemmer execution

Id:resultaa1

IDF execution

Id:fresultaa2

File: Dataset123

LovinsStemmer execution

Id:resultaa1

IDF execution

Id:fresultaa2

File: Dataset124

PorterStemmerexecution

Id:resultaa1

IDF execution

Id:fresultaa2

File: Dataset124

PorterStemmer execution

Id:resultaa1

IDF execution

Id:fresultaa2

File: Dataset124

PorterStemmer execution

Id:resultaa1

IDF execution

Id:fresultaa2

File: Dataset123

LovinsStemmer execution

Id:resultaa1

IDF execution

Id:fresultaa2

Id:resultaa1

Page 14: PhD Thesis: Mining abstractions in scientific workflows

Requirements

14PhD Thesis: Mining Abstractions in Scientific Workflows

Workflow template descriptionPlan: P-Plan [Garijo et al 2012] http://purl.org/net/p-plan

Workflow execution trace descriptionProvenance: PROV (W3C) [Lebo et al 2013]

http://www.w3.org/ns/prov#

Workflow attributionDublin Core, PROV (W3C)

Workflow metadata

Link between templates and executions

Scufl DAX

AGWL Dispel

IWIR

OPM

OBI EXPO ISA

PAV

RO D-PROV

[Cicarese et al 2013]

[Moreau et al 2011]

[Brinkman et al 2010][Soldatova and King 2006][Rocca et al 2008]

[Belhajjame et al 2012][Missier et al 2013]

[Oinn et al 2004]

[Fahringer et al 2005]

[Atkinson et al 2013]

[Plankensteiner et al 2005]

Page 15: PhD Thesis: Mining abstractions in scientific workflows

OPMW: Extending provenance standards and plan models

template1

opmw:isVariableOfTemplate

opmw:isVariableOfTemplate

Input Dataset

Term Weighting

Topics

p-plan:isOutputVarOf

p-plan:hasInputVar

opmw:isStepOfTemplate

opmw:correspondsToTemplate

opmw:correspondstoTemplateArtifact

opmw:correspondstoTemplateProcess

opmw:correspondstoTemplateArtifact

opmw:WorkflowExecutionProcess

opmw:WorkflowExecutionAccount

prov:Entity

prov:Activity

prov:Bundle

PROV, OPM Extension

opmv:Artifact

opmo:Account

opmv:Process

opmw:WorkflowExecutionArtifact

opmw:WorkflowTemplateArtifact

opmw:WorkflowTemplateProcess

opmw:WorkflowTemplate

p-plan:Plan

p-plan:Step

p-plan:Variable

P-Plan extension

Class Object property

Legend

Instance ofInstance Subclass of

15PhD Thesis: Mining Abstractions in Scientific Workflows

execution1

File: Dataset123

IDF(java)

File: FResultaa2

prov:wasGeneratedBy

prov:used

opmo:account

opmo:account

opmo:account

http://www.opmw.org/ontology/

Page 16: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and work methodology

3. Workflow representation: OPMWa) Requirementsb) The OPMW modelc) Publishing workflows as Linked Data

4. Workflow abstraction and reuse

5. Mining abstractions from workflows using graph mining techniques

6. Evaluation

7. Conclusions and future work16PhD Thesis: Mining Abstractions in Scientific Workflows

Page 17: PhD Thesis: Mining abstractions in scientific workflows

Publishing workflows as Linked Data

Specification

17PhD Thesis: Mining Abstractions in Scientific Workflows

Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner

Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system

1

Base URI = http://www.opmw.org/Ontology URI = http://www.opmw.org/ontology/Assertion URI = http://www.opmw.org/export/resource/ClassName/instanceName

Examples: http://www.opmw.org/export/resource/WorkflowTemplate/ABSTRACTSUBWFDOCKINGhttp://www.opmw.org/export/resource/WorkflowExecutionAccount/ACCOUNT1348629350796

Page 18: PhD Thesis: Mining abstractions in scientific workflows

Publishing workflows as Linked Data

Specification Modeling

18PhD Thesis: Mining Abstractions in Scientific Workflows

Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner

Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system

1 2

OPMW

P-Plan

OPM DC

PROV

Page 19: PhD Thesis: Mining abstractions in scientific workflows

Publishing workflows as Linked Data

Specification Modeling Generation

19PhD Thesis: Mining Abstractions in Scientific Workflows

Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner

Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system

1 2 3

Workflow system

Workflow Template

Workflowexecution

OPMWexport

OPMWRDF

Page 20: PhD Thesis: Mining abstractions in scientific workflows

Publishing workflows as Linked Data

Specification Modeling Generation Publication

20PhD Thesis: Mining Abstractions in Scientific Workflows

Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner

Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system

1 2 3 4

RDFTriple store

Permanentweb-

accessiblefile

store

RDF Upload Interface

SPARQL Endpoint

OPMWRDF

Page 21: PhD Thesis: Mining abstractions in scientific workflows

Publishing workflows as Linked Data

Specification Modeling Generation Publication

21PhD Thesis: Mining Abstractions in Scientific Workflows

Why Linked Data?• Facilitates exploitation of workflow resources in an homogeneous manner

Adapted methodology from [Villazón-Terrazas et al 2011]Tested it for the Wings workflow system

1 2 3 4

Exploitation

5

Curl Linked Data BrowserWorkflowExplorer

SPARQL endpoint

Page 22: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and contributions

3. Workflow representation: Open Provenance Model for Workflows

4. Workflow abstraction and reusea) A catalog of common workflow abstractionsb) Workflow reuse analysis

5. Mining abstractions from workflows using graph mining techniques

6. Evaluation

7. Conclusions and future work

22PhD Thesis: Mining Abstractions in Scientific Workflows

Page 23: PhD Thesis: Mining abstractions in scientific workflows

A catalog of common workflow abstractions

Generalization of workflow steps based on functionality.Workflow motif: Domain independent conceptual abstraction on the workflow steps.1.Data-oriented motifs: What kind of manipulations does the workflow have?

• E.g.: • Data retrieval • Data preparation• Data curation• Data visualization• etc.

23PhD Thesis: Mining Abstractions in Scientific Workflows

Page 24: PhD Thesis: Mining abstractions in scientific workflows

A catalog of common workflow abstractions

Generalization of workflow steps based on functionality.Workflow motif: Domain independent conceptual abstraction on the workflow steps.1.Data-oriented motifs: What kind of manipulations does the workflow have?

• E.g.: • Data retrieval • Data preparation• etc.

2. Workflow-oriented motifs: How does the workflow perform its operations?

•E.g.:• Stateful steps• Stateless steps• Human interactions• etc.

24PhD Thesis: Mining Abstractions in Scientific Workflows

Page 25: PhD Thesis: Mining abstractions in scientific workflows

Methodology for finding workflow motifs

Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence

25PhD Thesis: Mining Abstractions in Scientific Workflows

= 260 workflows

89 12526 20

Collect workflows

Page 26: PhD Thesis: Mining abstractions in scientific workflows

Methodology for finding workflow motifs

Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence

26PhD Thesis: Mining Abstractions in Scientific Workflows

Preliminary workflow analysis

Researcher 1 Researcher 2 Researcher 3

Page 27: PhD Thesis: Mining abstractions in scientific workflows

Methodology for finding workflow motifs

Goal: Reverse-engineer the set of current practices in workflow development through an analysis of empirical evidence

27PhD Thesis: Mining Abstractions in Scientific Workflows

Agreement and cross validation

Page 28: PhD Thesis: Mining abstractions in scientific workflows

Result Summary

28PhD Thesis: Mining Abstractions in Scientific Workflows

•Over 60% of the motifs are data preparation motifs

•Some differences are motivated by the workflow systems in the analysis

•Around 40% of workflows contain motifs related to workflow reuse

composite workflowsinternal macros

But how do users perceive workflow reuse?What about fragments of workflows?

Page 29: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and contributions

3. Workflow representation: Open Provenance Model for Workflows

4. Workflow abstraction and reusea) A catalog of common workflow abstractionsb) Workflow reuse survey

5. Mining abstractions from workflows using graph mining techniques

6. Evaluation

7. Conclusions and future work

29PhD Thesis: Mining Abstractions in Scientific Workflows

Page 30: PhD Thesis: Mining abstractions in scientific workflows

Use case: The LONI Pipeline

Workflow system for neuroimaging analysishttp://pipeline.loni.usc.edu/explore/library-navigator/

30PhD Thesis: Mining Abstractions in Scientific Workflows

Discussions with scientistsUser survey

Collect responsesfrom users

21 responses

Discuss results

Page 31: PhD Thesis: Mining abstractions in scientific workflows

Summary results

The majority of users agree that reusing and sharing workflows is useful

Unlike workflows, reusing groupings from one’s own work is more useful than reusing groupings from others

Most respondents agreed that groupings help simplify workflows.

Groupings also make workflows more understandable by others

31PhD Thesis: Mining Abstractions in Scientific Workflows

Can we detect groupings automatically?

Page 32: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and contributions

3. Workflow representation: Open Provenance Model for Workflows

4. Workflow abstraction and reuse

5. Mining abstractions from workflows using graph mining techniquesa) Corpus preparationb) Graph miningc) Fragment filteringd) Fragment linking

6. Evaluation

7. Conclusions and future work32PhD Thesis: Mining Abstractions in Scientific Workflows

Page 33: PhD Thesis: Mining abstractions in scientific workflows

Workflow mining approaches

Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]

33PhD Thesis: Mining Abstractions in Scientific Workflows

Workflow corpusCluster1

Cluster 2

Cluster 3

Workflow corpus

Page 34: PhD Thesis: Mining abstractions in scientific workflows

Workflow mining approaches

Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]

34PhD Thesis: Mining Abstractions in Scientific Workflows

Topic 1

Topic 2

P(Topic1) = 0.7P(Topic2)= 0.3

P(Topic1) = 0.5P(Topic2)= 0.5

P(Topic1) = 0.2P(Topic2)= 0.8 ….

Topic modeling [Stoyanovich et al 2010]

Page 35: PhD Thesis: Mining abstractions in scientific workflows

Workflow mining approaches

Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]

Topic modeling [Stoyanovich et al 2010]

35PhD Thesis: Mining Abstractions in Scientific Workflows

Case-based reasoning [Leake and Kendall-Morwick 2008], [Müller and Bergmann 2014]

Workflow corpus ?

Page 36: PhD Thesis: Mining abstractions in scientific workflows

?

Workflow mining approaches

Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]

Topic modeling [Stoyanovich et al 2010]

Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014]

Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008]

36PhD Thesis: Mining Abstractions in Scientific Workflows

Workflow corpus ?

PSM

Page 37: PhD Thesis: Mining abstractions in scientific workflows

Workflow mining approaches

Clustering [Montani and Leonardi 2012] , [García-Jimenez and Wilkinson, 2014]

Topic modeling [Stoyanovich et al 2010]

Case-based reasoning [Leake and Kendall-Morwick 2008] [Müller and Bergmann 2014]

Log mining [van del Aalst et al2003] [Gómez-Pérez and Corcho, 2008]

Graph mining [Diamantini et al., 2012]

37PhD Thesis: Mining Abstractions in Scientific Workflows

Page 38: PhD Thesis: Mining abstractions in scientific workflows

Workflow Mining in FragFlow

1

2

3

4

38PhD Thesis: Mining Abstractions in Scientific Workflows

Page 39: PhD Thesis: Mining abstractions in scientific workflows

Corpus Preparation

Workflows converted to Labeled Directed Acyclic Graphs (LDAG)• The label of a node in the graph corresponds to the type of the step in

the workflow

• Edges capture the dependencies between different steps

39PhD Thesis: Mining Abstractions in Scientific Workflows

Dataset

Stemmeralgorithm

Result

Term weightingalgorithm

FinalResult

Stemmeralgorithm

Term weightingalgorithm

Duplicated workflows are removed

Single-step workflows are removed

Page 40: PhD Thesis: Mining abstractions in scientific workflows

Graph Mining

We use popular graph mining techniques:

Inexact FSM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete

SUBDUE2 heuristics: Minimum Description Length (MDL) and Size

Exact FSM: deliver all the possible fragments to be found the dataset.gSpan

Depth first search strategyFSG

Breadth first search strategy

40PhD Thesis: Mining Abstractions in Scientific Workflows

Page 41: PhD Thesis: Mining abstractions in scientific workflows

Filtering Relevant Fragments

The number of resulting fragments can be very large. We distinguish:Multistep fragments:

More than one step

Filtered Multistep fragments:Multistep fragmentsContain all smaller fragments with the same number of

occurrences

41PhD Thesis: Mining Abstractions in Scientific Workflows

Stemmer

Term Weighting

Stemmer

Term Weighting

Filter

Filter

Sort

Filter

Sort

Query

F1

F2

F3

F4

(found 4 times)

(found 4 times)

(found 10 times)

(found 3 times)

Page 42: PhD Thesis: Mining abstractions in scientific workflows

Linking to the Corpus: Example

Workflow 1

42PhD Thesis: Mining Abstractions in Scientific Workflows

Stemmer

Term Weighting

Stemmer

Term Weighting

Merge

Stemmer

Term Weighting

Fragment1in Wf1(1)

Fragment1

Fragment1in Wf1(2)

Workflow fragment description vocabulary: http://purl.org/net/wf-fd

(Extends P-Plan)

wffd:foundAs

wffd:foundAs

wffd:foundInp-plan:isPrecededBy

p-plan:isPrecededByp-plan:isPrecededBy

p-plan:isPrecededBy p-plan:isPrecededBy p-plan:isStepOfPlan

p-plan:isStepOfPlan

p-plan:isStepOfPlan

p-plan:isStepOfPlan

p-plan:isStepOfPlan

p-plan:Step

wffd:TiedWorkflowFragment

wffd:DetectedResultWorkflowFragment

Page 43: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and contributions

3. Workflow representation: Open Provenance Model for Workflows

4. Workflow abstraction and reuse

5. Mining abstractions from workflows using graph mining techniques

6. Evaluationa) Finding generic motifs in workflowsb) Workflow fragment assessment

7. Conclusions and future work

43PhD Thesis: Mining Abstractions in Scientific Workflows

Page 44: PhD Thesis: Mining abstractions in scientific workflows

Finding generic motifs in workflows

44PhD Thesis: Mining Abstractions in Scientific Workflows

?

Research question: Can we find commonly occurring abstractions?

composite workflowsinternal macros

Page 45: PhD Thesis: Mining abstractions in scientific workflows

Finding generic motifs in workflows

45PhD Thesis: Mining Abstractions in Scientific Workflows

?

Metrics used: precision and recall

Fragments(F)

Annotatedmotifs

(M)

Page 46: PhD Thesis: Mining abstractions in scientific workflows

Finding generic motifs in workflows

46PhD Thesis: Mining Abstractions in Scientific Workflows

?

Corpus: 22 templates from the same domain annotated manually Wings workflow corpus + domain knowledge

Dataset

PorterStemmer

Result

IDF

FinalResult

Dataset

LovinsStemmer

Result

ResidualIDF

FinalResult

+

Dataset

Stemmer

Result

Term Weighting

FinalResult

Stemmer

Porter Stemmer

Lovins Stemmer

Term Weighting

Inverse Document Frequency (IDF)

Residual IDF

Query Term Weighting

Component taxonomy

Page 47: PhD Thesis: Mining abstractions in scientific workflows

Finding generic motifs in workflows

47PhD Thesis: Mining Abstractions in Scientific Workflows

?

Results of the evaluation

H.2: It is possible to detect commonly occurring patterns and abstractions automatically.

Internal Macros:Inexact FSM : 2 out of 3 found (r=0,67); 4 out of 5 (r=0,8) when

applying generalization

Composite Workflows:Exact FSM: all motifs are found, although the precision is low

(p=0,18)Can we find commonly occurring abstractions?

Page 48: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and contributions

3. Workflow representation: Open Provenance Model for Workflows

4. Workflow abstraction and reuse

5. Mining abstractions from workflows using graph mining techniques

6. Evaluationa) Finding generic motifs in workflowsb) Workflow fragment assessment

7. Conclusions and future work

48PhD Thesis: Mining Abstractions in Scientific Workflows

Page 49: PhD Thesis: Mining abstractions in scientific workflows

Workflow fragment assessment

49PhD Thesis: Mining Abstractions in Scientific Workflows

?

Research question: Are our proposed workflow fragments useful?• A fragment is useful if it has been designed and (re)used by a user.• Comparison between proposed fragments and user designed groupings

and workflow

Page 50: PhD Thesis: Mining abstractions in scientific workflows

Workflow fragment assessment

50PhD Thesis: Mining Abstractions in Scientific Workflows

?

Metrics: Precision and recall

Fragments(F)

Workflows(W)

Groupings(G)

Page 51: PhD Thesis: Mining abstractions in scientific workflows

Workflow fragment assessment

51PhD Thesis: Mining Abstractions in Scientific Workflows

?

Workflow corporaUser Corpus 1 (WC1)

• Designed mostly by a single a single user• 790 workflows (475 after data preparation)

User Corpus 2 (WC2)• Created by a user, with collaborations of others• 113 workflows (96 after data preparation)

Multi User Corpus 3 (WC3)• Workflows submitted by 62 users during the month of Jan 2014• 5859 workflows (357 after data preparation)

User Corpus 4 (WC4)• Designed mostly by a single a single user• 53 workflows (50 after data preparation)

Page 52: PhD Thesis: Mining abstractions in scientific workflows

Workflow fragment assessment

52PhD Thesis: Mining Abstractions in Scientific Workflows

?

Result assessment

• 30%-60% of proposed fragments are equal to user defined groupings or workflows

• 40%-80% of proposed of proposed fragments are equal or similar to user defined groupings or workflows

H.3: Commonly occurring patterns are potentially useful for users designing workflows

What about the rest of the fragments? Are those useful?

Page 53: PhD Thesis: Mining abstractions in scientific workflows

Workflow fragment assessment

53PhD Thesis: Mining Abstractions in Scientific Workflows

?

User feedback: user survey

Q1: Would you consider the proposed fragment a valuable grouping?•I would not select it as a grouping (0)•I would use it as a grouping with major changes (i.e., adding/removing more than 30% of the steps) (1)•I would use it as a grouping with minor changes (i.e., adding/removing less than 30% of the steps) (2).•I would use it as a grouping as it is (3)Q2: What do you think about the complexity of the fragment?•The fragment is too simple (0)•The fragment is fine as it is (1)•The fragment has too many steps (2)

Not enough evidence to state that all proposed workflow fragments are useful

Page 54: PhD Thesis: Mining abstractions in scientific workflows

Outline

1. Introduction and motivation

2. Hypothesis and contributions

3. Workflow representation: Open Provenance Model for Workflows

4. Workflow abstraction and reuse

5. Mining abstractions from workflows using graph mining techniques

6. Evaluation

7. Conclusions and future work

54PhD Thesis: Mining Abstractions in Scientific Workflows

Page 55: PhD Thesis: Mining abstractions in scientific workflows

Conclusions: Results

H.1: It is possible to define a catalog of common domain independent patterns based on the common functionality of workflow steps.

Daniel Garijo and Yolanda Gil. A new approach for publishing workflows: Abstractions, standards, and Linked Data. (WORKS'11)

Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis (extended version). Future Generation Computer Systems. 2013.

Model for representing workflows (OPMW) and publishing them as Linked Data

Catalog of workflow motifs + workflow annotation

H.2: It is possible to detect commonly occurring patterns and abstractions automatically.

Graph mining approach + workflow generalization

Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble. Common motifs in scientific workflows: An empirical analysis. 8th IEEE International Conference on e-Science (eScience 2012)

55PhD Thesis: Mining Abstractions in Scientific Workflows

Daniel Garijo, Oscar Corcho and Yolanda Gil. Detecting common scientific workflow fragments using templates and execution provenance. Proceedings of the seventh international conference on Knowledge capture, (K-CAP 2013).

Page 56: PhD Thesis: Mining abstractions in scientific workflows

Conclusions: Results

Daniel Garijo, Oscar Corcho, Yolanda Gil, Boris Gutman, Ivo D. Dinov, Paul Thompson and Arthur W. Toga. FragFlow: Automated fragment detection in scientific workflows. 10th IEEE Conference on e-Science, (eScience 2014)

Daniel Garijo, Oscar Corcho, Yolanda Gil, Meredith N. Braskie, Dereck Hibar, Xie Hua, Neda Jahanshad, Paul Thompson and Arthur W. Toga. Workflow reuse in practice: A study of neuroimaging pipeline users. 10th IEEE Conference on e-Science, (eScience 2014)

H.3: Commonly occurring patterns are potentially useful for users designing workflows.

Graph mining approach + reusability metrics for assessment + workflow annotation

56PhD Thesis: Mining Abstractions in Scientific Workflows

Reuse survey

Page 57: PhD Thesis: Mining abstractions in scientific workflows

Conclusions: Impact and future work

Impact:OPMW • Workflow annotation [García-Jiménez and Wilkinson 2014b]

Motif catalog • Expansion for distributed environments [Olabarriaga et al 2013]• Workflow summarization [Alper et al 2013]

Future work: • Towards workflow ecosystems

57PhD Thesis: Mining Abstractions in Scientific Workflows

[Garijo et al 2014] (WORKS’14)

Page 58: PhD Thesis: Mining abstractions in scientific workflows

Conclusions: Impact and future work

•Automatic detection of workflow abstractions

58PhD Thesis: Mining Abstractions in Scientific Workflows

•Improvement of workflow reuse

Custom fragments

Ranking fragments

Suggestions of workflows

Page 59: PhD Thesis: Mining abstractions in scientific workflows

Date: 03/12/2015

Mining Abstractions in Scientific Workflows

Daniel Garijo *Supervisors: Oscar Corcho *, Yolanda Gil Ŧ

* Universidad Politécnica de Madrid,Ŧ USC Information Sciences Institute

All materials are available as Research Objects (with pointers to Figshare)

http://w3id.org/dgarijo/ro/mining-abstractions-in-scientific-wfs

Page 60: PhD Thesis: Mining abstractions in scientific workflows

Supporting material

60PhD Thesis: Mining Abstractions in Scientific Workflows

Page 61: PhD Thesis: Mining abstractions in scientific workflows

Methodology

Workflow representation and publicationApproach

Workflow abstraction and reuseEmpirical

analysis ofworkflowcorpora

Problem Evaluation

Requirement validation anduser feedback

Model Competencyquestionvalidation

Provenance

Plan

Publication

Methodology for publication

Extension of existing

standardsand web

technologies

Workflow abstraction analysis for

reuse

Agreement on a catalog of

common abstractions

Automatic detection and annotation of workflow abstractions

Graph mining techniques,

generalization

Precision, recall and user

feedback

61PhD Thesis: Mining Abstractions in Scientific Workflows

Page 62: PhD Thesis: Mining abstractions in scientific workflows

Provenance Models

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 62

“A record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data or a thing”-PROV-DM: The PROV Data Model (W3C)

Page 63: PhD Thesis: Mining abstractions in scientific workflows

Replace this slide with a methodological one

prov:used

p-plan:Variable

p-plan:isStepOfPlan

p-plan:isVariableOfPlan

p-plan:hasInputVar

p-plan:isOutputVarOf

p-plan:Activity

p-plan: correspondsToStep

p-plan:Entity

prov:wasGeneratedBy

p-plan:isPrecededBy

p-plan:Bundle

Class Object property

Legend

Subclass of

prov:Bundle

prov:Plan

prov:Entity

prov:Activity

PRO

V ex

tend

ed c

lass

es

Statements contained in a p-plan:Bundle

p-plan:Step

p-plan:Plan

p-plan: correspondsToVariable

63PhD Thesis: Mining Abstractions in Scientific Workflows

Page 64: PhD Thesis: Mining abstractions in scientific workflows

Assumptions and restrictions

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 64

Restriction:• Workflows are represented as directed acyclic graphs

Assumptions: • Available workflow repositories exist for exploiting definitions

of workflows and workflow executions.• All the workflow steps can be assigned a label with their type• Two steps of a workflow with the same function have the same

type.• Researchers aim to reuse workflows and workflow fragments if

they find them useful.

Page 65: PhD Thesis: Mining abstractions in scientific workflows

9

Other models for representing workflow instances, templates and executions

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 66: PhD Thesis: Mining abstractions in scientific workflows

Publishing as LD

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid 66

•Maybe paste here an example instead of the big picture

Page 67: PhD Thesis: Mining abstractions in scientific workflows

67

Data Oriented MotifsData-Oriented Motifs

Data Retrieval

Data Preparation

Format Transformation

Input Augmentation and Output Splitting

Data Organisation

Data Analysis

Data Curation/Cleaning

Data Movement

Data Visualisation

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 68: PhD Thesis: Mining abstractions in scientific workflows

68

Data Oriented MotifsData-Oriented Motifs

Data Retrieval

Data Preparation

Format Transformation

Input Augmentation and Output Splitting

Data Organisation

Data Analysis

Data Curation/Cleaning

Data Movement

Data Visualisation

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 69: PhD Thesis: Mining abstractions in scientific workflows

69

Data Oriented MotifsData-Oriented Motifs

Data Retrieval

Data Preparation

Format Transformation

Input Augmentation and Output Splitting

Data Organisation

Data Analysis

Data Curation/Cleaning

Data Movement

Data Visualisation

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 70: PhD Thesis: Mining abstractions in scientific workflows

70

Data Oriented MotifsData-Oriented Motifs

Data Retrieval

Data Preparation

Format Transformation

Input Augmentation and Output Splitting

Data Organisation

Data Analysis

Data Curation/Cleaning

Data Movement

Data Visualisation

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 71: PhD Thesis: Mining abstractions in scientific workflows

71

Data Oriented MotifsData-Oriented Motifs

Data Retrieval

Data Preparation

Format Transformation

Input Augmentation and Output Splitting

Data Organisation

Data Analysis

Data Curation/Cleaning

Data Movement

Data Visualisation

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 72: PhD Thesis: Mining abstractions in scientific workflows

72

Data Oriented MotifsData-Oriented Motifs

Data Retrieval

Data Preparation

Format Transformation

Input Augmentation and Output Splitting

Data Organisation

Data Analysis

Data Curation/Cleaning

Data Movement

Data Visualisation

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 73: PhD Thesis: Mining abstractions in scientific workflows

73

Workflow Oriented MotifsWorkflow-Oriented Motifs

Intra-Workflow Motifs

Stateful (Asynchronous) Invocations

Stateless (Synchronous) Invocations

Internal Macros

Human Interactions

Inter-Workflow Motifs

Atomic Workflows

Composite Workflows

Workflow Overloading

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 74: PhD Thesis: Mining abstractions in scientific workflows

74

Workflow Oriented MotifsWorkflow-Oriented Motifs

Intra-Workflow Motifs

Stateful (Asynchronous) Invocations

Stateless (Synchronous) Invocations

Internal Macros

Human Interactions

Inter-Workflow Motifs

Atomic Workflows

Composite Workflows

Workflow Overloading

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 75: PhD Thesis: Mining abstractions in scientific workflows

75

Workflow Oriented MotifsWorkflow-Oriented Motifs

Intra-Workflow Motifs

Stateful (Asynchronous) Invocations

Stateless (Synchronous) Invocations

Internal Macros

Human Interactions

Inter-Workflow Motifs

Atomic Workflows

Composite Workflows

Workflow Overloading

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 76: PhD Thesis: Mining abstractions in scientific workflows

76

Workflow Oriented MotifsWorkflow-Oriented Motifs

Intra-Workflow Motifs

Stateful (Asynchronous) Invocations

Stateless (Synchronous) Invocations

Internal Macros

Human Interactions

Inter-Workflow Motifs

Atomic Workflows

Composite Workflows

Workflow Overloading

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 77: PhD Thesis: Mining abstractions in scientific workflows

77

Workflow Oriented MotifsWorkflow-Oriented Motifs

Intra-Workflow Motifs

Stateful (Asynchronous) Invocations

Stateless (Synchronous) Invocations

Internal Macros

Human Interactions

Inter-Workflow Motifs

Atomic Workflows

Composite Workflows

Workflow Overloading

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 78: PhD Thesis: Mining abstractions in scientific workflows

Result Summary: Data Oriented Motifs

•Over 60% of the motifs are data preparation motifs

•Some differences are motivated by the workflow systems in the analysis

•Data analysis is often the main functionality of the workflow

78PhD Thesis: Mining Abstractions in Scientific Workflows

Page 79: PhD Thesis: Mining abstractions in scientific workflows

Result Summary: Workflow Oriented Motifs

• Around 40% composite workflows and internal macros

But how do users perceive workflow reuse?• What about fragments of workflows?

79PhD Thesis: Mining Abstractions in Scientific Workflows

Page 80: PhD Thesis: Mining abstractions in scientific workflows

80

Differences and commonalities of the workflow systems

•Data moving/retrieval, stateful interactions and human interaction steps are not present in Wings• Web services (Taverna) versus software components (Wings)• Wings has layered execution through Pegasus

•Data preparation steps are common in both systems

•Use of sub workflows is high

PhD Thesis: Mining Abstractions in Scientific Workflows- Madrid

Page 81: PhD Thesis: Mining abstractions in scientific workflows

Reusing workflows…

According to the respondents, the major benefits of workflows include:• Time savings •Organizing and storing code• Having a visualization of the overall analysis• Facilitating reproducibility

81PhD Thesis: Mining Abstractions in Scientific Workflows

Page 82: PhD Thesis: Mining abstractions in scientific workflows

Reusing groupings…

•Reuse is not the only reason why groupings are created. Unlike workflows, reusing groupings from one’s own work is more useful than reusing groupings from others

•Most respondents agreed that groupings help simplify workflows. Groupings also make workflows more understandable by others

82PhD Thesis: Mining Abstractions in Scientific Workflows

Page 83: PhD Thesis: Mining abstractions in scientific workflows

Graph Mining

We use popular graph mining techniques:

Inexact FSM: usage of heuristics to calculate similarity between two graphs. The solution might not be complete

SUBDUE• 2 heuristics: Minimum Description Length (MDL) and Size• Frequency based

Exact FSM: deliver all the possible fragments to be found the dataset.gSpan• Depth first search strategy• Support based

FSG• Breadth first search strategy• Support based

83PhD Thesis: Mining Abstractions in Scientific Workflows

Page 84: PhD Thesis: Mining abstractions in scientific workflows

Linking to the Corpus: Workflow fragment description vocabulary

84PhD Thesis: Mining Abstractions in Scientific Workflows

Page 85: PhD Thesis: Mining abstractions in scientific workflows

Workflow fragment assessment: Summary of results

85PhD Thesis: Mining Abstractions in Scientific Workflows

Page 86: PhD Thesis: Mining abstractions in scientific workflows

Conclusions: Limitations

L1: OPMW has been designed for data-intensive workflows (without loops or conditionals)

L2: When publishing as Linked Data, it is assumed that all resources will be made public (no privacy issues)

L3: Motif catalog may be expanded with additional motifs

L4: Size and time needed to calculate some workflow fragments

L5: A taxonomy of components is needed when generalizing workflows. This taxonomy is provided by domain experts modeling the domain.

86PhD Thesis: Mining Abstractions in Scientific Workflows