From Data to Knowledge with Workflows & Provenance

89
From Data to Knowledge with Workflows & Provenance Bertram Ludäscher Graduate School of Library and Information Science (GSLIS) Affiliate: National Center for Supercomputing Applications (NCSA) Department of Computer Science (CS @ Illinois)

description

NCSA colloquium on Sept 12, 2014: http://illinois.edu/calendar/detail/1435?eventId=32072828&calMin=201409&cal=20140209&skinId=160

Transcript of From Data to Knowledge with Workflows & Provenance

Page 1: From Data to Knowledge with Workflows & Provenance

From Data to Knowledge with Workflows & Provenance

Bertram Ludäscher

Graduate School of Library and Information Science (GSLIS) Affiliate:

National Center for Supercomputing Applications (NCSA) Department of Computer Science (CS @ Illinois)

Page 2: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Outline

•  About Yours Truly –  … where I’m coming from –  … strange loops …

•  From Data To Knowledge … •  … Scientific Workflows (CI “Upper-Ware”) •  … and Provenance (part of CI “Underware”)

•  Other Research Interests & Projects –  Reprise (… me not) –  Sept. 19: CIRSS Seminar @ GSLIS (Reasoning about Taxonomies) –  Sept. 23: (Oct 7) Yahoo!-DAIS Seminar@CS (First-order Provenance Games)

Page 3: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Some Personal Provenance … •  Studies of Computer Science at Uni Karlsruhe (TH)

–  … my Alma Mater now defunct!?? L –  … deus ex machina: K.I.T. (Karlsruhe Institute of Technology) J –  Fridericiana Polytechnic (1825) ... TU Karlsruhe (1865) ... KIT (2009)

•  Undergrad work: Task-Setup Service (TSS) –  part of HECTOR (HEterogeneous Computers TOgetheR, IBM & U-KA), top-layer above

DACNOS (Distributed Academic Network Operating System) –  early “upper-ware”!

•  … (scientific) workflows!!

Page 4: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Sacred Scrolls … Prophesizing the Grid (DACNOS) & workflows (TSS)

Foerster, Cora. "Controlling Distributed User Tasks in Heterogeneous Networks." In HECTOR: Heterogeneous Computers Together. A Joint Project of IBM and the University of Karlsruhe. Springer Berlin Heidelberg, 1988.

“All this has happened before, and all this will happen again”

Page 5: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

… too much C hacking … on to AI & Logic!

•  Workflows? Hacking? –  Boring…

•  Databases?? –  Boooring!!

•  AI, Logic Programming? –  Sounds good! –  Non-monotonic reasoning

•  Well-founded semantics •  Stable models (now ASP)

•  MSc (Diplom) –  First-order theorem prover

(BDD variant)

“All this has happened before, and all this will happen again”

Page 6: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

… and onto (logic) databases!

•  PhD at University of Freiburg

“All this has happened before, and all this will happen again”

Page 7: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

… fast forward to the present (back to the future!)

•  Datalog becomes popular again: –  Datalog 2.0 in Oxford and Vienna: The resurgence of

Datalog in academia and industry

•  Statelog is in demand again –  The Declarative Imperative: Experiences and Conjectures in

Distributed Logic. Joe Hellerstein. PODS Keynote, 2010.

•  LogicBlox Inc. (Atlanta) – Re-invent how enterprise software is built – Under the hood: LogiQL

•  … a high-performance Datalog engine

Page 8: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Datalog Plus: l  Skolem functions l  Existentials in the head l  Meta-Programming layer l  Integration with LP Solvers l  Expressive constraints l  ...

Language Execution-Engine Cloud: l  Cost-based optimizer l  Versioned data-structures l  Full serializability Browser: l  Compiled to Javascript

Re-invent how enterprise software is built

Unified Runtime

based on Datalog

Vision

Molham Aref, LogicBlox

Page 9: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

•  1998-2004 SDSC & CSE Dept –  NARA, digital libraries

•  w/ Reagan Moore

–  Data Integration research –  Started Kepler

•  w/ Matt Jones, Ilkay Altintas, … •  Head start: Ptolemy II (open source)

–  EECS @ Berkeley (E.A. Lee)

–  Naming things is fun! •  Mediation of Information in XML (MIX) •  Blended Browsing & Querying (BBQ) •  Knowledge-based Information Integration of

Neuroscience Data (KIND) •  Ptolemy.. Copernicus … Kepler! •  Neon… Geosciences … Network … GEON!

… down by the sea !

Page 10: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

… from SoCal to NorCal … to the Midwest!

•  2004-2014 UC Davis

•  Major projects (finished) –  Kepler/CORE, pPOD, ChIP-chip,

COMET, SDM, REAP •  Ongoing & new:

–  FilteredPush –  Euler, Exploring Taxon Concepts –  DataONE –  Kurator

•  Research themes (& names :-) –  Scientific data mgmt, workflows,

provenance, KR&R, data curation … –  Kepler/COMAD, X-CSR, Euler …

UC DAVIS Department of Computer Science

Page 11: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

The 4th Paradigm

•  CI, e-Science •  bioinformatics •  ecoinformatics •  geoinformatics

•  Big Data •  Data Science •  Information Science •  Digital Humanities …

Page 12: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Scientific Workflows: Cyberinfrastructure “Upperware”

Underware

Middleware

Upper Middleware

Upperware

NSF/SEEK ITR collaboration (2002-2008): SDSC, UCSB, UC Davis, UNM, UK, …

Page 13: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Problem: Stitching together Tools and Databases •  Tool Integration

–  local, remote, tools, services, databases, applications

–  BLAST on myPC? –  My R script on the

cluster? •  Data Handling

–  Where’s the data? Access methods?

–  A.out doesn’t fit B.in –  Many runs, experiments

•  Automate, optimize, scale, reuse, share wfs

•  “Explain” results

Page 14: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

“Integration Technologies” for Data, Tools, Models

•  State of the art in tool integration often involves plumbing, stitching, and stapling …

Page 15: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Scientific Workflows: ASAP! •  Automation

–  wfs to automate computational aspects of science –  batch processing, scripting

•  Scaling (exploit and optimize machine cycles) –  wfs should make use of parallel compute resources

•  dataflow-orientation avoids von Neumann bottleneck •  use parallel MoCs when deploying on cluster, cloud

–  wfs should be able handle large data •  Abstraction, Evolution, Reuse (human cycles)

–  wfs should be easy to change, evolve, share, reuse •  Provenance

–  wfs should capture processing history, data lineage è traceable data- and wf-evolution è  Reproducible Science

Page 16: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Find  OTUs  

(OTUHunter)  

Assign  Taxonomy      (STAP)  

Profile  alignment  

(STAP  or  Infernal)  

Build  phylogeneAc  tree  (RaxML  or  Quicktree)  

View  tree:  Dendroscope  

UniFrac:    tree  &  

environment  file  

Assembled  conAgs  

Chimera  check  

 (Mallard)  

Diversity  staAsAcs:  Text:  OUT  list,  Chao1,  Shannon  

Graphs:  rarefacAon  curves,  rank-­‐abundance  curves  

VisualizaAon  tools:  Cytoscape  networks  &  Heat  map  

WATERS: Workflow  for  Alignment,  Taxonomy,  Ecology  of  Ribosomal  Sequences  (Amber  Hartman;  Eisen  Lab;  UC  Davis)  

+/-­‐  cipres  

+/-­‐  cluster  

+/-­‐  cluster  

+/-­‐  cluster  

Page 17: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Executable WATERS Workflow in Kepler

Page 18: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Example Bioinformatics Workflow: Motif-Catcher

Marc Facciotti et al. UC Davis Genome Center

Page 19: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Motif-Catcher workflow, implemented in Kepler

S Köhler et al. Improved Motif Detection in Large Sequence Sets with Random Sampling in a Kepler workflow, ICCS-WS, 2012

Page 20: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

A Data-Streaming Workflow over Sensor Data

Page 21: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Kepler Workflows & Decision Making (Kruger Natl. Park, South Africa)

SANParks Matt Jones, NCEAS @ UC Santa Barbara

Page 22: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Scientific workflows: a(nother) silver bullet?

Beware of the Turing tar-pit in which everything is possible but nothing of interest is easy.

—Alan Perlis

Page 23: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Scientific Workflow Design: Some Challenges

“And the graphical UI makes our scientific workflows so much easier to develop, understand and maintain!”

Page 24: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Human Cycles vs Machine Cycles

•  Traditional Computer Science and HPC focus: –  optimize algorithms, save

machine cycles –  massively parallelize

execution •  The most expensive cycles:

–  Human cycles! –  Big scalability issues …

•  cf. Bernie’s “Big Data” ~ big problems with data!

•  Not either one or the other: –  … better together! (cf. BSG)

Page 25: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Overview: My Scientific Workflow Research

Modeling & Design

Provenance

Parallel Execution

Fault-Tolerance, Crash Recovery

Page 26: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

•  Monitor and control supercomputer simulations

–  50+ composite actors (subworkflows)

–  4 levels of hierarchy –  1000+ atomic (Java) actors

43 actors, 3 levels

196 actors, 4 levels 30 actors

206 actors, 4 levels

137 actors 33 actors

150 123 actors

66 actors 12 actors

243 actors, 4 levels

Norbert Podhorszki ORNL (then: UC Davis)

Programming in the large?

Page 27: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

"Structured Plumbing" in Kepler

Cabellos et al. Computer Physics Communications 182, 2011

Page 28: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Modeling & Design: Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt

•  Vanilla Process Network

•  Functional Programming Dataflow Network

•  XML Transformation Network

•  Collection-oriented Modeling & Design framework (COMAD)

–  “Look Ma: No Shims!”

Page 29: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Problems with [too many] Shims and Wires

•  Shims need to be placed and connected –  Tedious, error-prone

•  Distract from scientific meaningful actors –  Non-descriptive workflows – worth sharing?

•  Data Organization is encoded in workflow structure –  Not robust to data changes

•  Shims often lead to complex designs –  Imagine all previous `design-patterns’ intertwined –  GOTO-programming

COMAD/VDAL: Raising the level of abstraction "   Localized control-flow

"   Data management not done via wires

"   Actors are coupled not by wire but by data!

Page 30: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Pipelined Collection-Oriented Workflows Collection-Oriented Modeling & Design (COMAD)

–  fully embrace the assembly line metaphor

–  data = tagged nested collections

–  e.g. represented as flattened, pipelined (XML) token streams:

Actors (like assembly line workers), pass on what they don’t work on

T McPhillips, S Bowers, D Zinn, B Ludäscher

Page 31: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Layers in COMAD / VDAL Pipelines

WF Graph

Configurations (white-box)

Scientific Functions (black-boxes)

CipresRAxML In: DNASeq+

Thres: Float

Method: String

Out: (t:Tree, s:score)+

• Access data in XML stream • Call Scientific Functions (Services) • Put results back into stream

Page 32: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

COMAD/VDAL Actor Execution Semantics

Page 33: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Two different workflow designs

•  Hardwiring vs. configurable data/collection management •  brittle vs. change resilient designs •  scientist can recognize napkin drawing/conceptual model •  Human cycles are expensive

Page 34: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

ADIOS in Kepler

Page 35: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

ADIOS in COMAD

Page 36: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Conceptual Pipeline w/ Scopes & Types

Daniel Zinn et al. ICDE’09

Page 37: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Optimizing Execution Schedules: Paral�lel

Paral·lel (Barcelona

Metro)

Page 38: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

X-CSR (“XML Scissor”): Cut-Ship-Reassemble

Daniel Zinn et al. ICDE’09

Page 39: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Workflow Execution Analysis and Optimization

A:1

B:1

d1

d2

d3

Actor A Queue Actor B

2

Comadlayer<C>

</C>

d1

d2

d3

<C>

</C>

B:1:2

B:1:3

B:1:1

Comadlayer

3

COMAD: Kepler PN:

Optimal Schedule:

Analysis + Data mining

Sven Köhler

Page 40: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Dataflow Network (generic) and Views

Page 41: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Kahn Process Networks

Kahn, Gilles & David MacQueen. "Coroutines and networks of parallel processes." (1976).

Kahn, Gilles. "The semantics of a simple language for parallel programming." (1974)

Page 42: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Synchronous Dataflow (SDF)

Lee, Edward A., and David G. Messerschmitt. "Synchronous data flow." Proc. of the IEEE 75, no. 9 (1987): 1235-1245.

Page 43: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Workflow Recovery in SDF

Page 44: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Idea: “Rescue DAG” (cf. Condor/DAGMan)

Sven Köhler et al. Improving Workflow Fault Tolerance through Provenance-Based Recovery. SSDBM 2011

Page 45: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

COMAD

Page 46: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

VisTrails [Juliana Freire, et al]

Page 47: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Restflow (Tim McPhillips)

Page 48: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

So many MoCs, so little time …

Page 49: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Outline

•  About Yours Truly –  … where I’m coming from –  … strange loops …

•  From Data To Knowledge … •  … Scientific Workflows (CI “Upper-Ware”) •  … and Provenance (part of CI “Underware”)

•  Other Research Interests & Projects –  Reprise (… me not) –  Sept. 19: CIRSS Seminar @ GSLIS (Reasoning about Taxonomies) –  Sept. 23: Yahoo!-DAIS Seminar @ CS (First-order Provenance Games)

Page 50: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

From “Climate Gate” to Reproducible Science

Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science => need workflow-system agnostic model

Page 51: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Data & Provenance Management: Model Chains

Page 52: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

The Data Life Cycle

Page 53: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

From Data Life-Cycle to Curation Life-Cycle

Uncanny Resemblance: Eye of Jupiter (“Vision Thing”?)

DCC Curation Lifecycle

Page 54: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Common Uses of Provenance Data in Science

•  Audit trail: trace data generation and possible errors •  Attribution: determine ownership and responsibility for data

and scientific results •  Data quality: from quality of input data, computations •  Discovery: enable searching of data, methodologies

and experiments •  Replication: facilitate repeatable derivation of data to

maintain currency ⇒  Reproducible Science But: different MoCs imply different Observables (and

“Knowables”) è different MoPs

Page 55: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

The Executable Paper

Executable Paper Grand Challenge International Conference on

Computational Science, ICCS 2011 The Collage Authoring Environment

Piotr Nowakowskia*et al.

Page 56: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Motivation: Virtual Joint Experiments

•  How do we ensure that Charlie gets a complete account of the history of Wc’s outputs?

•  How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob’s data v? è traces TA and TB will be critical è need to compose them to obtain TC

We can view the composition WC as a new, virtual workflow

Charlie

Alice

(1) develop! WA

(2) run! RA

z x Bob

(3) develop!WB

(5) run!RB

v u f

v

WC:=

(6) inspect

provenance!

(7) understand,

generate!W

A W

S W

B

u z x

(4) data sharing!

TA! TB!f -1

Page 57: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Provenance Composition: the Data Tree of Life (DToL) •  We can formulate our questions in terms of provenance

of the datasets produced by virtual workflow WC: –  What is the complete provenance of v?

•  Answering the question requires tracing v’s derivation all the way to x

•  But, to achieve this, we need to ensure: •  TA and TB are properly connected •  Provenance queries run seamlessly over and across TA and TB

Charlie

Alice

(1) develop! WA

(2) run! RA

z x Bob

(3) develop!WB

(5) run!RB

v u f

v

WC:=

(6) inspect

provenance!

(7) understand,

generate!W

A W

S W

B

u z x

(4) data sharing!

TA! TB!f -1

Page 58: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Scientific Workflow Provenance in Action

WF Engine

ProvExplorer

ReproZip DataONE

ReproZip

WF Engine

Page 59: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Data Quality & Curation Workflows •  Collections & occurrence data

is all over the map –  … literally (off the map!)

•  Issues: –  Lat/Long transposition,

coordinate & projection issues –  Data entry/creation, “fuzzy”

data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it)

•  Filtered-Push Collaboration

Page 60: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Page 61: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Filtered-Push: Kurator (Data Curation Workflows)

Tianhong Song

Lei Dou (former member)

Sven Köhler

Page 62: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Data Curation Pipeline (w/ your friends in the loop)

[SPHNC’2011]

Page 63: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Curation Workflow: Features

•  Human-in-the-loop –  You “wrapped” your buddies/experts into the workflow!

•  Uses Open Authorization •  Certain changes captured in the data

–  ... by workflow developer/engineer –  Highlighted in the spreadsheets (cf. “duplicate records”)

•  Automatic capture of provenance information –  data lineage and processing history

•  Provenance information –  can be visualized, browsed, and queried

[SPHNC’2011]

Page 64: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Koogle: Google Cloud + Kepler

[SPHNC’2011]

Page 65: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Koogle Kuration package: Kepler + Google cloud (esp. spreadsheet) services

actors functions importer import data to a spreadsheet exporter export data from a spreadsheet copy copy a spreadsheet from a template share share the spreadsheet with another user query query data from the spreadsheet

auditor allow human interaction during the execution of the workflow

[SPHNC’2011]

Page 66: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

You’ve got Mail! (Two curation requests)

[SPHNC’2011]

Page 67: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Inspect, edit (if necessary), submit!

[SPHNC’2011]

Page 68: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

… second request

[SPHNC’2011]

Page 69: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

DONE! Summary message…

[SPHNC’2011]

Page 70: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

[SPHNC’2011]

Page 71: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

http://www.youtube.com/watch?v=DEkPbvLsud0

[SPHNC’2011]

Page 72: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

FilteredPush Curation Provenance (Spreadsheet View)

Page 73: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

… and then there is One More Thing …

Page 74: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

An End-to-End Climate Workflow

Configure Climate Model

Data Repository

Search Data

Process Data

Model Inputs

Build Climate Model

Run Climate Model

Model Outputs

Exploration, Visualization, & Analysis

Uncertainty Quantification

Diagnostics Generation

Exploratory Analysis

Model Benchmarking Archive Data

Repository

Src: Yaxing Wei, ORNL (EVA WG)

Page 75: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Model Benchmarking using UV-CDAT

Workflow

Result

Src: Yaxing Wei, ORNL (EVA WG)

Page 76: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

DataONE Provenance & Semantics Use Case

The North American Carbon Program Multi-Scale Synthesis and Terrestrial Model Intercomparison Project D. N. Huntzinger1, C. Schwalm2, A. M. Michalak3, K. Schaefer4,5, A. W. King6, Y. Wei6, A. Jacobson4,7, S. Liu6, R. B. Cook6, W. M. Post6, G. Berthier8, D. Hayes6, M. Huang9, A. Ito10, H. Lei11,12, C. Lu13, J. Mao6, C. H. Peng14,15, S. Peng8, B. Poulter8, D. Riccuito6, X. Shi6, H. Tian13, W. Wang16, N. Zeng17, F. Zhao17, and Q. Zhu15

Provenance •  Externally facing •  Internally facing

Page 77: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

D-OPM: DataONE version of OPM for sci-wf

D-OPM (DataONE ProvWG)

OPM-W Daniel Garijo, Yoland Gil

Page 78: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Structural Integrity: Traces è Workflows

Structural integrity

Implied temporal constraints

Temporal constraint declaration

Page 79: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Logic / Rule-based Provenance Analyzer

Related: Prov-WG

Saumen Dey

Page 80: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

From Models of Computation to Models of Provenance

M.  Anand,  S.  Bowers,    et  al.,  SSDBM’09  

Page 81: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Fine-grained, Data & MoC-aware MoP

M. Anand, S. Bowers, et al., SSDBM’09

Page 82: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Hamming Numbers (executable Kepler workflow)

Compute Hamming numbers H in order, where a.k.a. regular numbers or 5-smooth numbers (numbers whose prime divisors are less or equal to 5).

Babylonian clay tablet with annotations. The diagonal displays an approximation of the square root of 2 in four sexagesimal figures, which is about six decimal figures. 1 + 24/60 + 51/602 + 10/603 = 1.41421296...

Page 83: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Two Hamming workflow variants: H1 vs. H3

Page 84: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

It's Quiz-Time again! X2

X3

X5

S2

S3

S5

Q1

Q2

Q3

M1

M2

Q4

Q5

Q6

Q7

Q8

X2

X3

X5

S2

S3

S5

Q1

Q2

Q3

M1

M2

Q4

Q5

Q6

Q7

Q8

Hamming Trace

Does it match Hamming Workflow H1?

… or Hamming Workflow H3 ??

Page 85: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Hamming Traces – "Debugged"

1

2

3

5

4

6

10

9

15

25

8

12

20

18

30

50

27

45

75

16

24

40

36

60

100

125

54

90

150

32

48

80

72

120

200

81

135

225

250

108

180

300

375

64

96

160

144

240

400

162

270

450

500

216

360

600

625

243

405

675

750

128

192

320

288

480

800

324

540

900

1000 432

720

486

810

256

384

640

576

960

648

729

864

972

512

768

1

2

3

5

4

6

10

9

15

25

8

12

20

18

30

50

27

45

75

16

24

40

36

60

100

125

54

90

150

32

48

80

72

120

200

81

135

225

250

108

180

300

375

64

96

160

144

240

400

162

270

450

500

216

360

600

625

243

405

675

750

128

192

320

288

480

800

324

540

900

1000

432

720

486

810

256

384

640

576

960

648

729

864

972

512

768

Trace of H1 ("Fish") Trace of H3 ("Sail")

Page 86: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Provenance & Privacy (ProPub: Provenance Publisher)

Saumen Dey, UC Davis

Page 87: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Meet Prof. Nico Franz: Curator of Insects @ ASU

Page 88: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

From Tool Users to Tool Makers

Screen capture… back to the original definition

Page 89: From Data to Knowledge with Workflows & Provenance

NCSA Colloquium Sep 12, 2014 Data to Knowledge w/ Scientific Workflows & Provenance B. Ludäscher

Conclusion: Better Together

•  Human & Machine Cycles –  Better information and workflow modeling (COMAD/VDAL) –  and more scalable execution (X-CSR, tagged dataflow, …)

•  Theory & Practice –  Experimental theory (CS problems + ASP + Info Vis)

•  e.g. rediscovering Dedekind numbers via taxonomy debugging –  D(N) = |monotone Boolean functions over N variables|

–  Information Science & Software-Carpentry •  Support tool makers!

•  Big Data, Data Science, and all the rest! –  Excited to work at the intersection of GSLIS & NCSA & CS!