Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

16
Paolo Missier (1) , Bertram Ludscher (2) , Shawn Bowers (3) , Saumen Dey (2) , Anandarup Sarkar (3) , Biva Shrestha (4) , Ilkay Altintas (5) , Manish Kumar Anand (5) , Carole Goble (1) Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science (1)School of Computer Science, University of Manchester (2)Dept. of Computer Science, University of California, Davis (3)Dept. of Computer Science, Gonzaga University (4)Dept. of Computer Science, Appalachian State University (5)San Diego Supercomputer Center, University of California, San Diego WORKS’10, New Orleans

description

Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).

Transcript of Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Page 1: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Paolo Missier(1), Bertram Ludascher(2), Shawn Bowers(3),

Saumen Dey(2), Anandarup Sarkar(3), Biva Shrestha(4),

Ilkay Altintas(5), Manish Kumar Anand(5), Carole Goble(1)

Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science

(1) School of Computer Science, University of Manchester

(2) Dept. of Computer Science, University of California, Davis

(3) Dept. of Computer Science, Gonzaga University

(4) Dept. of Computer Science, Appalachian State University

(5) San Diego Supercomputer Center, University of California, San Diego

WORKS’10, New Orleans

Page 2: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Context: Data Sharing• Implicit collaboration through data sharing

– Alice uses nth generation input dataset x and produces n+1st output dataset z

– … as part of run RA of workflow WA

– … output z is published in some data-space.

– Bob uses Alice’s outputs z and produces n+2nd generation dataset v

– … using workflow WB, possibly with pre-processing f

– Alice and Bob may not know each other

Page 3: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Motivation: Virtual Joint Experiments

• How do we ensure that Charlie gets a complete account of the history of Wc’s outputs?

• How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob’s data v? traces TA and TB will be critical

need to compose them to obtain TC

We can view the composition WC as a new, virtual workflow

Page 4: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Provenance Composition: the Data Tree of Life (DToL)

• We can formulate our questions in terms of provenance of the datasets produced by virtual workflow WC

:

– What is the complete provenance of v?

• Answering the question requires tracing v’s derivation all the way to x

• But, to achieve this, we need to ensure:

• TA and TB are properly connected

• Provenance queries run seamlessly over and across TA and TB

Page 5: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Test scenario: 1st Provenance Challenge Workflow

• DataONE Summer-of-Code Project– Split First Provenance Challenge workflow at various points

– Publish Part-I from system X, use as input for Part-II on system Y

• X, Y in { Kepler/SDF, Kepler/COMAD, Taverna }

Page 6: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Common Model of Provenance (approx. OPM)

Data provenance for a single workflow run is well understood

TA trace instance of WA:h: TA W➔ A homomorphismh(x1 a➔ 1) = h(x2 a➔ 2) = X A,➔h(a1 y➔ 1) = h(a2 y➔ 2) = A Y➔...

Workflow spec: digraphW= (VW, EW)

VW = A ∪ C- actors A (processors) - channels C (FIFO data buffers) EW = Ein ∪ Eout

in edges Ein ⊆ A x Cout edges Eout ⊆ C x A

Trace graph: acyclic digraphT = (VT, ET)

VT = I ∪ D (invocations I, data D)ET = Eread ∪ Ewrite

read edges Eread ⊆ D x Iwrite edges Ewrite ⊆ I x D

Page 7: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Data and Invocation Dependencies (ddep, idep)

- read, write are natural observables for a workflow run- possible additional relations (recorded or inferred):

• invocation dependencies:

• data dependencies:

“a2 depends on a1” because a1 has written data d, a2 has read d

Explicit or via:

Explicit or via:

“d2 depends on d1” … because some actor invocation a read d1 prior to writing d2

(Note: in some models of computation the rules above are not correct)

Page 8: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Provenance queries

• Local (“non-closure”) queries on a trace T:– Find the data and traces published by Alice / Bob

– Find the inputs, outputs, and intermediate data products of T

– Find (selected) actors and channels used in T

– Find inputs and outputs of an invocation ai in T

Easy and not very interestingE.g. answer to (3) is just the set of nodes in h(T)

• Closure queries:• operate on the transitive closure ddep* over ddep:

• suppose ddep* spans multiple traces TA, TB

• we must define the standard query:

so that it operates on the composition of TA, TB

Page 9: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Issues in Provenance Composition

• Main problems and approaches:

Closure queries now must span multiple provenance traces

– heterogeneity of both workflow and provenance models

• I - Trace disconnect:

– traces that should “join” on the shared data, are really disconnected

– make data sharing process itself provenance-aware

• III - Data identifiers mismatch

– different workflows adopt different data identification schemes

– assert data equivalence as part of provenance

• II - Model heterogeneity:

– common provenance model with local global mapping➔

– different workflow and provenance models

Page 10: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Part I – Provenance Stitching• The missing link: make every data copy step provenance-aware

- r : data reference in store S- trace-equivalence of data items d in S, d’ in S’: d ≃d’if d’ is obtained by copying d from S to S’:

Page 11: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Part II - Mapping to a Common Provenance Model

• Mapping rules (= code, queries) defined from Kepler and Taverna provenance models to common model (details omitted):

In the result TP each reference r found in TS is replaced with ρ(r)

– OPM used as intermediate target model

– … doesn’t “nail” everything

– a mixed blessing …

– … but team-work made it work!

Page 12: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Part III – Data Identifier Reconciliation

• We have seen that the copy operation …

r’ = copy(r, S, S’)

• … on shared data store S generates a data equivalence assertion

• It also keep track of ID mappings:added to renaming map

from a set of S-specific references to a set of public references

Page 13: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Extended (across-runs) Provenance Queries• Closure queries are redefined on the extended provenance trace

that includes trace-equivalences d≃d’

as follows:

for instance between

Page 14: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Prototype Architecture

Page 15: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

Conclusions 1/2

• In theory, provenance interoperability should be solved/easy using e.g. OPM

• In practice it isn’t (cf. Provenance Challenge workshops), e.g.– different mappings to OPM

– different identifier schemes

– traces broken “at the seams”

• Summer-of-code DToL prototype demonstrates feasibility of provenance-aware collaboration / workflow interoperation through data– Extends potential of provenance analysis beyond isolated workflow-

based experiments

• Findings relevant for data preservation in – Tracing data access is key

Page 16: Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010

Linking Provenance Traces … P Missier, B Ludascher et al. WORKS’10

• DataONE:– http://www.dataone.org/

• Data Tree-of-Life (DToL Summer Project)– https://sites.google.com/site/datatolproject/

• Runtime wf systems interoperability can be very hard– … and benefits not clear (unless “layered” approach w/ different roles

of wf systems) wf provenance interoperability to the rescue!

• Next Steps:– DataONE Working Group on Provenance for Scientific Workflows

– Develop DOPM (DataONE Provenance Model; OPM++)

Conclusions 2/2