P-Plan

23
1 Yolanda Gil USC Information Sciences Institute [email protected] Augmenting PROV with Plans in P-PLAN: Scientific Processes as Linked Data Daniel Garijo OEG-DIA Facultad de Informática Universidad Politécnica de Madrid [email protected] Yolanda Gil Information Sciences Institute and Department of Computer Science University of Southern California http://www.isi.edu/~gil

description

Provenance models are crucial for describing experimental results in science. The W3C Provenance Working Group has recently released the PROV family of specifications for provenance on the Web. While provenance focuses on what is executed, it is important in science to publish the general methods that describe scientific processes at a more abstract and general level. In this paper, we propose P-PLAN, an extension of PROV to represent plans that guid-ed the execution and their correspondence to provenance records that describe the execution itself. We motivate and discuss the use of P-PLAN and PROV to publish scientific workflows as Linked Data.

Transcript of P-Plan

Page 1: P-Plan

1Yolanda GilUSC Information Sciences Institute

[email protected]

Augmenting PROV with Plans in P-PLAN: Scientific Processes as Linked Data

Daniel Garijo

OEG-DIAFacultad de Informática

Universidad Politécnica de Madrid

[email protected]

Yolanda Gil

Information Sciences Institute and Department of Computer ScienceUniversity of Southern California

http://www.isi.edu/~gil

Page 2: P-Plan

2Yolanda GilUSC Information Sciences Institute

[email protected]

W3C PROV http://www.w3.org/2011/prov/

Page 3: P-Plan

3Yolanda GilUSC Information Sciences Institute

[email protected]

A Workflow Executionin PROV

Benefits:• Makes the work

inspectable Shortcomings:

• Hard to reproduce• Not efficient to

reuse

Page 4: P-Plan

4Yolanda GilUSC Information Sciences Institute

[email protected]

Reproducibility

Page 5: P-Plan

5Yolanda GilUSC Information Sciences Institute

[email protected]

Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]

Page 6: P-Plan

6Yolanda GilUSC Information Sciences Institute

[email protected]

Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]

Page 7: P-Plan

7Yolanda GilUSC Information Sciences Institute

[email protected]

Reusability Lower cost

• “Scientists and engineers spend more than 60% of their time just preparing the data for model input or data-model comparison” (NASA A40)

Better quality• “We write QC without thinking about

the best way to do the WC. Such approaches perpetuate mediocrity. If someone did it right once, it would benefit many people.” (EC WF CQ)

More efficient• “I often see that I’m repeating the work

that 100 other people have been doing to obtain and process the data.” (EC WF CQ)

Page 8: P-Plan

8Yolanda GilUSC Information Sciences Institute

[email protected]

Access to Data Analytics Expertise [Science 2011]

Page 9: P-Plan

9Yolanda GilUSC Information Sciences Institute

[email protected]

The TB-Drugome [Kinnings et al., PLoS CompBio 2010]

“We report a computational approach to construct a drug-target network… applied to the genome of tuberculosis…”

“The TB-drugome reveals that approximately one-third of the drugs examined have the potential to… treat tuberculosis…”

“The methodology can be applied to other pathogens of interest …”

Page 10: P-Plan

10Yolanda GilUSC Information Sciences Institute

[email protected]

Executable and Abstract Workflow

What I actually run The method that I followed

Page 11: P-Plan

11Yolanda GilUSC Information Sciences Institute

[email protected]

The Ontology for Biomedical Investigationshttp://obi-ontology.org/

Page 12: P-Plan

12Yolanda GilUSC Information Sciences Institute

[email protected]

Semantic Web Applications in Neuromedicine (SWAN) Ontology http://www.w3.org/TR/hcls-swan/

Page 13: P-Plan

13Yolanda GilUSC Information Sciences Institute

[email protected]

Research Objectshttp://www.wf4ever-project.org/research-object-model

Page 14: P-Plan

14Yolanda GilUSC Information Sciences Institute

[email protected]

Executable and Abstract Workflow

What I actually run The method that I followed

Page 15: P-Plan

15Yolanda GilUSC Information Sciences Institute

[email protected]

Semantic Workflows in Wings[Gil et al 10][Gil et al 09][Kim & Gil et al 08][Kim et al 06]

Workflows are augmented with semantic constraints

• Each workflow constituent has a variable associated with it

– Workflow components, arguments, datasets

• Constraints are used to restrict workflow variables

• Can define abstract classes of components

– Concrete components model exec. codes

Workflow reasoners propagate and use semantic constraints

Uses semantic web standards: OWL/RDF, SPARQL, rules

9

Page 16: P-Plan

16Yolanda GilUSC Information Sciences Institute

[email protected]

Documents

Plain text

MarkupInDoc

htmlDoc

latexDoc

Language

EnFr

Model

DecTree

SVMFeatureVector

Size

Ontologies for Data and Workflow Components

CorrelationScoring

ChiSq InfoGain

Modeler

LinearRegression

DecTreeModeler

C4.5 J48

MutInfo

MatLab_LR

R_LRWeka-C4.5

WSJ-2010

Page 17: P-Plan

17Yolanda GilUSC Information Sciences Institute

[email protected]

Semantic Workflows: Abstractions Based on Ontologies [Gil et al 2011]

Term Weighting

Correlation Scoring

TF-IDF

Chi Squared

CODE

CODE

Page 18: P-Plan

18Yolanda GilUSC Information Sciences Institute

[email protected]

Publishing Workflows on the Web with OPMWhttp://www.opmw.org

account

account

account

Abstract template Node

Workflowtemplate

Input artifact1

Input artifact2

Outputartifact1

Abstractcomponent

Execution Node

Execution Input1

Execution Input2

Execution result

Specificcomponent

Execution account

Workflow Template Execution Results

user

account

accounthasArtifact

hasArtifact

hasWorkflowTemplate

hasArtifactTemplate

hasProcessTemplate

hasArtifactTemplate

hasArtifactTemplate

subClassOfwasGeneratedBy

wasGeneratedBy

used

used

usedused

wasControlledBy

hasSpecificComponenthasAbstractComponenthasProcess

Process

ArtifactArtifact

Artifact

Agent

AccountOPM Graph

Process

Artifact Artifact

Artifact

Red: OPM model

Black: OPMW profile (extension)

Extension of the Open Provenance Model

Page 19: P-Plan

19Yolanda GilUSC Information Sciences Institute

[email protected]

Published as Linked Data: Executed Workflow + Abstract Workflow + Data + Steps + Codes…

Page 20: P-Plan

20Yolanda GilUSC Information Sciences Institute

[email protected]

P-PLAN: Extending PROV to represent plans

Plan representations can be very complex• Iteration, conditionals, decomposition, etc.

P-PLAN is a core representation with only:• Sequences of steps• Parallel steps

P-PLAN, like PROV, is a DAG• Simplest representation of plans

Page 21: P-Plan

21Yolanda GilUSC Information Sciences Institute

[email protected]

P-Plan

Page 22: P-Plan

22Yolanda GilUSC Information Sciences Institute

[email protected]

Queries about Workflows Published as Linked Data

Find all abstract workflows (?plan) in which a given entity (?entity) has been used when executing them

SELECT DISTINCT ?plan WHERE { ?entity a p-plan:Entity,prov:Entity; p-plan:correspondsTo ?templVariable. ?templVariable a p-plan:Variable; p-plan:isVariableOfPlan ?plan.}

Page 23: P-Plan

23Yolanda GilUSC Information Sciences Institute

[email protected]

Conclusions

Linked data as a vehicle to publish science processes• Workflows, experiments, …

Important to publish method, not just provenance• Reproducibility, efficiency, access to expertise

W3C PROV useful to publish execution P-PLAN is an extension of PROV for publishing

methods• Plan, step, variable

P-PLAN is applicable beyond science